MDST242 C2 - The Open University

The Open Untvemty MnthemotirslSoc~olSc~entes~k~encefTerhnolagy An mter fotulty second level tourse Unit C2 Scientific experiments :I' STATISTICS The Open University Mathematics~Sotialkiences/kience/Technology An inter-fatuity second level tourse MDST242 Statistics in Society Block C Data in context Unit C 2 Scientific experiments Prepared by the course team The Open University The Open University Walton Hall. Milton Keynes MK7 6AA First published 1983. Reprinted 1986, 1989. 1993. 1997. 1998 Copyright 0 1983 The Open University All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means. without permission in writing from the publisher. Designed by the Graphic Design Group of the Open University. Printed in Great Britain at The Alden Press, Oxford. This text forms part of an Open University course. The complete list of units in the course appears at the end of this text. For general availability of supporting material referred to in this text. please write to Open University Educational Enterprises Limited. 12 Cofferidge Close. Stony Stratford. Milton Keynes MKI I I BY. Great Britain. Further information on Open University courses may be obtained from the Admissions Office. The Open University, P.O. Box 48, Walton Hall. Milton Keynes MK7 6AB. Contents Introduction Scientific experiments What are experiments? Different kinds of experiment Experiments and statistics Carrying out an experiment Purpose of the experiment Setting up the experiment Maintaining the experiment Cutting the stems Completing the experiment Variability and systematic errors The 1-test Which test ? The z-test reconsidered The two-sample t-test Analysis of mustard seedling data The r-test continued Why the t-test? Checking the conditions The one-sample t-test The matched-pairs t-test Other matched-pairs tests Confidence intervals from t-tests Analysis of variance Solutions to exercises Index Acknowledgements Introduction Unit C1 introduced just one particular area, drug testing, in which experimentation plays an important role. Although drug testing is undoubtedly an extremely important activity, it is by no means the only one in which experimentation figures prominently. Much of present-day technological knowledge in areas as diverse as agriculture and electronics depends to a greater or lesser extent on knowledge which has been collected from scientific experiments. This unit discusses more fully than was possible in the last exactly what is involved in a scientific experiment and what are its key features. In Section 2 of this unit, you are asked to set up an experiment on the growth of plants, using mustard (or cress) seeds. This should give you an idea of some of the problems that are involved in even the simplest of experiments. The timing of the experiment is important and you therefore need to plan your schedule accordingly. It will probably take you about one-and-a-half hours to set up the experiment, assuming that you have already collected the items in the list which we sent you some weeks ago. You will have to spend a few minutes on your experiment for each one of the 4 or 5 days after setting it up and then about one-and-a-half hours taking measurements from your plants. The analysis of the data from this experiment uses a new test which is similar to the ztest. This test is called the t-test and it is the major topic of Sections 3 and 4. 1 Scientific experiments In this section we shall first describe those forms of inquiry for which we use the term experiment. We shall then describe three different kinds of experiment common in scientific inquiry before looking at one of these kinds in more detail because it is the one which most commonly involves statistical analysis. 1.1 What are experiments? The term experiment means a variety of things to a variety of people. To many it conjures up a vision of a white-coated individual (usually a balding man with National Health spectacles) surrounded by dials, flashing lights and vaporous fluids bubbling sullenly in mysteriously coiled vessels. To others an experiment involves no more than adding a new ingredient to a tried and trusted flan recipe to see if the taste is improved, or planting the bulbs earlier than in previous years to see if they do better. These uses are all perfectly proper: it is not what you do that qualifies an activity as an experiment, it is the way you do it. In other words, the area in which you carry out an experiment might be nuclear physics or it might be cookery, neither lies outside the province of experimentation, but if other people are going to accept that you have carried out an experiment then you must use certain methods and procedures. One fundamental feature of any experiment is that it is a public activity. This does not mean that an experiment has to be performed in front of a crowd. It means that everything that an experimenter does should be open to inspection by other people. If person (A) carries out an experiment then he or she should be able to explain everything that took place during the experiment in such a way that another person (B) could, if necessary, go trough exactly the same procedure. The results of this experiment should, again, be open to public scrutiny so that B's results can be compared with A's. This idea is already familiar from Unit CL If a drug company carries out a clinical trial on a new beta-blocker then they must be able to describe every part of their procedure to the outside world: including how the experiment was designed; how many subjects were involved; how the drug was administered and in what quantities; how its effects were measured etc. In fact, the Committee on the Safety of Medicines requires such information before they grant a product licence. In just the same way, a cook who experiments by altering an ingredient in a recipe should be able to explain every part of the revised recipe so that other people could repeat the revised procedure exactly. A second Fundamental feature of an experiment is that it sets out to answer a specific question. Does this drug alter blood pressure? Does this ingredient improve the taste of the flan? Does planting the bulbs earlier produce better flowers? Notice that each of these questions is framed in such a way as to demand that something be measured if the question is to be answered: namely blood pressure, taste or flower quality. How should a cook carry out an experiment to discover whether the addition of an extra ingredient improves the taste of a flan? Activity 1.1 Solotiim The procedure could be exactly the same as that which a drug company uses to discover whether a new drug reduces headaches. The cook prepares two flans, one with and one without the new ingredient. A panel of tasters is recruited and each taster tries each flan in a double-blind experiment. A crossover design would probably be most suitable so that the tasters act as their own controls,i.e. each tastes both flans. Half of the tasters test the old recipe first and the other half test the new recipe first. Each taster rates the taste of each flan on a 5-point scale and the ratings of the new and the old recipes are analysed by using a hypothesis test. This activity shows that it is possible to carry out scientific experiments on all sorts of things, not just on formally recognized scientific subjects. To illustrate the difference between the scientific approach to a question and other forms of inquiry, it is instructive to consider how a scientificexperimentmight be carrid out on such an unlikely subject as poetry appreciation. Take a poem such as Keats's ode To Autumn (Figure 1.1). Figure 1.1 A literary critic might ask the question: 4s this a great poem?'. Framed in this way the question is not amenable to scientificexperiment. People might vary in their opinion of the poem and they would probably produce more or less convincing evidence to support their viewpoint. One might point to the images which the poet uses and argue that they successfully convey the mood of autumn, another might complain that the overall effect is too'languid for his taste etc. Although people might agree as to what are good and bad criteria for judging the greatness of a poem, and although informed opinion might generally agree as to the greatness of this particular poem, there is no sense in which assessing the greatness of this poem is a scientific experiment. However, it is possible to ask questions about the poem which are open to experimental investigation. Activity 1.2 Since the poem has three verses, someone might argue that the third verse is inferior to the others, and that without it the poem is greatly improved. How might you test this experimentally? Solution One possibility is to use two groups of people, as similar as possible with respect to those features which you think might be relevant to poetry appreciation, such as age, sex, educational background, and ignorance of the poem in question. Each person in one group would be asked to read the three-verse version of the poem and rate how good it is on a 7-point scale. Each person in the other group would be asked to read, and rate, the truncated two-verse version. The two batches of ratings would then be analysed by applying the Mann-Whitney test. m Of course nobody would imagine that this particular experiment is of any literary value. Our purpose is simply to show that experimental investigation need not be limited to the physical world. 1.2 Different kinds of experiment All experiments have in common these two important features they are public they produce measurements or observations designed to answer specific questions. Nonetheless there are different kinds of experiment: they differ in that they address different kinds of question. This is not a case of, for example, some experiments being about physics and others about chemistry; it is a case of the questions being different in their purpose. Some experiments answer questions of the form What happens if I do this? Some of the experiments that you met in Unit C1 answered questions of this sort. For example, the question What happens if I give this person this new drug? might arise in the early stages of drug testing. Such experiments are exploratory in nature, in the same way that toddlers7 investigations of their surroundings are exploratory. A second kind of experiment has, as its primary purpose, not exploration but the measurement of a particular attribute. Such experiments are designed to answer questions such as What is the velocity of light? How heavy is the earth? How old is this rock? What is the population of the UK? This kind of experiment is a very important part of scientific investigation but we shall not discuss such experiments at length in this unit. A third kind of experiment, however, is both important and relevant to this block of the course. This is the kind that tests a specific hypothesis. Perhaps the best way to explain this kind of experiment is to give some examples. ANALYSE Francis Bacon, a sixteenth~ ~ ~ ~ ~ contem~orariesto carrv out experiients of this type and, for this reason, you may find such exdoratory experiments referred to as ~ a c o i a n . ~ { Example 1.1 Hypothesis The baby is crying because it is cold. Prediction based on hypothesis If the baby is made warmer then its crying will stop. Test Make baby warmer but make sure that you do not change anything else in its environment. Possible results and conclusions The baby continues crying, therefore the hypothesis is wrong. The baby stops crying, therefore the hypothesis is supported. Example 1.2 Hypothesis The television is not working because the fuse in the plug is broken. Prediction based on hypothesis If the fuse is replaced with one that does work then the television will work again. Test Replace the fuse but do not alter anything else. This ensures that the experiment has proper controls. Since you cannot discount the possibility that it would have stopped anyway, you cannot say that the hypothesis is correct. Possible results and conclusions The television still does not work, therefore the hypothesis is wrong. The television works, therefore the hypothesis is supported. Example 1.3 Hypothesis Food putrefies if left for too long because of the action of microbes (small organisms) that are present in the air and that come into contact with it. Prediction based on hypothesis If the microbes are prevented from acting on the food then it will not putrefy. Test Prevent the microbes from acting by killing them or otherwise preventing them from acting (for example, by deep-feeezing). Possible results and conclusions The food putrefies, therefore the hypothesis is wrong. The food does not putrefy, therefore the hypothesis is supported. I Example 1.4 Hypothesis Sound is transmitted through air because sound is a form of mechanical vibration and air contains particles (molecules) that can jostle each other and so pass the vibration from one particle to the next. Prediction based on hypothesis If all the air is removed from a container so that a vacuum remains then it should not be possible to transmit sound through that vacuum. For example, it should not be possible to hear an electric bell ringing inside a container from which the air has been removed (Figure 1.2). Test Remove air from a container and discover whether an electric bell inside it becomes inaudible. Possible results and conclusions The bell can be heard through the vacuum, therefore the hypothesis is wrong. The bell cannot be heard, therefore the hypothesis is supported. The first two of these examples are homely examples of hypothesis-testing experiments of a kind that people routinely carry out in their daily lives. The latter two examples are well-known, formal, scientific experiments. Example 1.3 was first carried out by Louis Pasteur (1822-95), who successfully prevented microbes from attacking food by filtering the air in which the food was kept, and by taking food to the tops of mountains where the combination of a low temperature and relatively microbe-free air prevented the food from decaying. All four of these examples have the following important features in common. In each there is a specific hypothesis about the cause of a phenomenon. Hypothesis-testing experiments try to explain how something works. The experimenter makes predictions that flow directly from the hypothesis: if the hypothesis is true then certain things must follow. For example, if it is true that microbes are the solecause of food decay then it automatically follows that food should not decay if microbes are absent. The experiment itself consists of testing whether the things that the hypothesis predicts actually happen. If the result of the experiment is not what the hypothesis predicted then the experimenter has to accept that the hypothesis is wrong. For example, if food were to continue to decay, even when it was absolutely certain that no microbes were Figure 1.2 Electric bell present, then the hypothesis that microbes are the only cause of food decay would have to be rejected. It is important to note, however, that even if the prediction that the hypothesis makes turns out to be correct then it may still be wrong to assume that the hypothesis itself is perfectly correct. Take, for example, the hypothesis that malaria is caused by a mosquito (more specifically, by a particular type of mosquito called Anopheles). A prediction which follows from this hypothesis is that malaria should cease to occur in a district from which Anopheles mosquitos have been eradicated. In fact, this is what normally happens in practice, so it would seem reasonable to conclude that the hypothesis is correct. However, although it is correct in one sense, it is not in another. In a district from which the Anopheles mosquito has been eradicated, some people may still continue to contract malaria, apparently spontaneously, long after the mosquitos have gone. If you were ignorant of these cases of malaria then it would be legitimate to believe in the original hypothesis that Anopheles mosquitos cause malaria. As soon as these few instances come to light, the original hypothesis must be discarded and a new one sought. In the case of malaria, the truth is that the mosquito is only a carrier for a parasite, called Plasmodium, which is the direct cause of malaria. Plasmodium can survive for long periods in the human body without giving rise to malaria. From time to time, however, it invades the blood and malaria then develops. To summarize, if the predictions that follow from a hypothesis turn out to be incorrect then the hypothesis has to be abandoned or, at least, modified. If the predictions turn out to be correct then the hypothesis is supported in the sense that it can be accepted provisionally that the hypothesis is correct. There is always the possibility that, one day, somebody may find that under certain conditions the predictions are incorrect. When that happens, the hypothesis has to be replaced by a new one. 1.3 Experiments and statistics Before going any further, it is important to consider how this description of scientific hypothesis-testing experiments fits in with the ideas of statistical hypothesis-testing in Block B. From the statistician's point of view, the above four examples of hypothesistesting experiments are framed in rather unconventional terms, because each concentrates on a hypothesis of the form something afSects something and tests predictions flowing from that hypothesis. Nevertheless, it is perfectly possible to formulate each of the above experiments in statistical terms with a null and an alternative hypothesis and it is essential to do so if you want to apply statistical hypothesis tests to data arising from such experiments. The null hypothesis, as its name implies,very often postulates the absence ofagiven effect or relationship. For example, in the clinical trial of a drug, the null hypothesis postulates that its effect does not differ from that ofthe placebo. The alternative hypothesis proposes that its effect does differ from that of the placebo. When we carry out a statistical hypothesis test, we concentrate on the null hypothesis. We say What is the probability of obtaining a result at least as extreme as that which we have obtained, if we assume that the null hypothesis is true? I HYPOTHESIS I TEST STATISTIC I CRITICAL VALUE r-l COMPARE See Section 2.2 of Unit B2. If this probability is too low then we reject the null hypothesis in favour of the alternative. We shall now devise a null and an alternative hypothesis for two of the four experiments. In each case we shall assume that the tests are the same as those described earlier and state what conclusions we draw from the different results that could be obtained. Example 1 .l (continued) Null hypothesis Cold has no effect on the baby's crying. Alternative hypothesis Cold makes the baby cry. Test Alter the temperature of the baby's surroundings and set up suitable controls. Possible results and conclusions The baby's crying stops, therefore reject the null hypothesis. The baby's crying continues, therefore the null hypothesis is supported. Often, as with these examples the null hypothesis in the statistical hypothesis test is the converse of the hypothesis in the scientific experiment; as a result, the reformulation is often not logically equivalent to the original experiment. Example 1.2 (continued) Null hypothesis The present condition of the fuse is not responsible for the television's refusal to work. Alternative hypothesis The television is not working because the fuse is broken. Test Replace the fuse but do not alter anything else. Possible results and conclusions The television works, therefore reject the null hypothesis. The television still does not work, therefore the null hypothesis is supported. m Activity 1.3 Express the experiments in Examples 1.3 and 1.4 as statistical hypothesis tests. Solution Example 1.3 Null hypothesis Microbes d o not make food putrefy. Alternative hypothesis Microbes do make food putrefy. Test Prevent microbes from acting on the food Possible results and conclusions The food does not putrefy, therefore reject the null hypothesis. The food putrefies, therefore the null hypothesis is supported. Example 1.4 Null hypothesis Sound is not transmitted by the jostling of air molecules. Alternative hypothesis Sound is transmitted by the jostling of air molecules. Test Ring a bell inside a vessel with all the air pumped out. Possible results and conclusions. The bell is inaudible, therefore reject the null hypothesis. The bell can be heard, therefore the null hypothesis is supported. One final point needs to be made about the relationship between hypothesis-testing experiments and statistical hypothesis tests. A great deal of scientific experimentation consists of testing specific hypotheses of the sort just described. If the experiments require statistical analysis then the experimenter should use statistical hypothesis tests. This will be the case if the experiments involve things that are intrinsically variable: such as people, plants or animals. Sometimes, however, a scientist may not be so much interested in testing whether or not a given treatment has an effect (maybe somebody has already done an experiment which shows that it does) but rather in investigating how big that effect is. An agriculturalist might want to know, for example, by how much a new fertilizer increases the yield of a crop. In such instances the scientist conducting the research project would use statistical estimation procedures (for example, finding a confidence interval) rather than carrying out a hypothesis test. Exercise 1.l State whether each of the following is a scientific experiment. (a) Measurement of the distance between the earth and the sun. (b) Accelerating and braking less fiercely in a car to see whether petrol consumption is reduced. (c) Reading a tea-taster's report on a brand of tea in order to decide whether to buy it or not. Solution on p. 43 (d) Investigating whether overweight is caused by overeating. Exercise 1.2 For each of the scientific experiments in Exercise 1.1, state which of the three kinds it is: exploratory, measurement or hypothesis-testing. Solution on p. 43 Exercise 1.3 For those scientific experiments in Exercise 1.1 which are hypothesis-testing experiments, formulate them as statistical hypothesis tests. Solution on p. 43 Hypothesis-testing experiments involve testing the deductions that can be made from a hypothesis. For this reason, scientific experimentation is sometimes said to proceed by the bypotheticodeductive method. There is currently considerable debate about whether science does proceed by this method under all circumstances but this debate lies outside the scope of this course. Many scientists feel that carrying out an experiment is rather like riding a bicycle: if you think too hard about what you are doing, you fall off! Accordingly, it is probably most profitable to leave such discussions of the nature of scientific experimentation and to get down to the practicalities of carrying out an experiment. This description is frequently with the philosopher Karl Popper. Summary Scientific experiments are distinguished by two main features. All stages of an experiment are open to public scrutiny and verification. Experiments are designed to answer specific questions by making specific measurements. At least three kinds of experiment can be recognized: they are distinguished by the kind of questions they attempt to answer. These are exploratory (Baconian) experiments measurement experiments hypothesis-testing (hypothetico-deductive) experiments. After completing your work on this section you should be able to distinguish between experimental and non-experimental forms of inquiry distinguish between the three kinds of experiments described above. 2 Carrying out an experiment Now that you have read about some of the basic principles of experimentation, it is worth trying to put these principles into practice by setting up and carrying out your own experiment. You should read the whole of this section before you start the experiment. The experiment is concerned with the growth of mustard (or cress) seedlings. You should set up the experiment and then leave the seedlings to grow for 4 or 5 days, checking them from time to time and pruning some of them after 2 or 3 days. You will then need to take measurements from them; by this time you should have read Section 3 and so be ready to analyse your results statistically. Provided that you have already collected all the necessary items it should not take you more than about one-and-a-half hours to set up the experiment. Here is the list of items that ,you need. Items needed for the experiment Two identical plastic containers such as small flower pots or empty cartons, e.g. of margarine. These need to be at least 23 inches (6 cm) in diameter at the open end or, if rectangular, need to have sides at least 2 inches (5 cm) long. Two large containers (ice cream tubs, sandwich boxes, small buckets or basins) which will hold a pint of water without leaking, and which can each hold one of the plastic containers mentioned above. These larger containers will need to stand side-by-side in a well-illuminated position, e.g. on a window ledge. Mustard is better because it grows faster but cress will do perfectly well. The details of these activities are in Sections 2.2-2.5. Figure 2.1 2 v ,.- Pint or water in each large container <. Holes in base of each small container One small packet of mustard seeds. (If these are difficult to obtain then use cress seeds.) One piece of aluminium kitchen foil about 30 cm X 30cm (12 inches X 12 inches). One piece ofclear kitchen foil (not cling-film) about 30 cm X 30 cm (12inches X 12 inches). Some Andrex, Dixcel or other superior toilet tissue. Two sheets of high quality (typing or writing) paper, preferably coloured so that the seedlings show up against the paper. Four elastic bands to go round the tops of the flower pots or margarine or yoghurt cartons. A milk bottle or other pint container. A pair of dividers or a piece of plasticine (or a similar material) plus two pins. A ruler measuring mm. Two teaspoons. A pair of fairly small, sharp-pointed scissors. A magnifying glass would be helpful but is not essential. 2.1 Purpose of the experiment The purpose of the experiment is to discover whether light affects the growth of plant roots. Most people are familiar with the fact that light is important to plants and realize that it affects the growth ofthe stems and leaves. If you keep a potted plant on a window sill, for example, and never turn it round then it is likely to grow quite noticeably towards the light. But what effect, if any, does light have on the growth of the roots? Normally, ofcourse, the roots are underground and so are not exposed to the light but it is quite conceivable that the roots could become exposed as they grew; for example, if the plant was growing on an irregular surface or if rain washed some of the soil away. Under such circumstances the plant would probably not benefit from the root continuing to grow out into the open. Roots give plants mechanical support and they absorb water and nutrients from the surrounding soil. They can d o none of these things if they are growing in the open air. It seems a plausible hypothesis, therefore, that light suppresses root growth so that the roots will tend not to grow in the open air. +p$-, INTERPRET BACK A N A L Y S E It is also possible to argue plausibly that the roots will tend to grow longer in the open air. The experiment which you are asked to carry out is on mustard seedlings. Thus the precise question to be answered is Does the light affect the root growth of mustard seedlings? The principle of the experiment is simple. You are asked to grow two groups ofmustard seedlings, one entirely in the dark and the other entirely in the light, but otherwise in as near identical conditions as possible. After some time has elapsed you should measure the lengths of the roots of the seedlings and compare the two groups. As in any experiment, it is essential to control various factors. COLLECT CLARIFY p--,H? INTERPRET BACK ANALYSE Activity 2.1 List some factors which need to be controlled and suggest how such control might in each case be achieved. Solution The seedlings need to grow under conditions that are as similar as possible except for the presence or absence oflight. It is especially important that the seedlings in the two groups are grown at the same temperature and humidity, and that they are equally spaced from each other, so that they do not suffer from unequal crowding effects. By growing the two sets of seedlings side-by-side in adjacent pots it should be possible to control for temperature. By carefully spacing the seeds in the pots it should be possible to avoid unequal crowding effects. Darkness will be achieved for one set of seedlings by covering the pot in which they are growing with aluminium kitchen foil. As well as cutting out the light, this is likely to have the effect ofraising the humidity in the enclosed air space. It is important that the seedlings grown in the light should experience similar humidity and this can be achieved by covering their pot as well. The simplest solution is to cover the pot with transparent foil. However, aluminium might conduct heat differently from transparent foil so this might introduce an unwanted difference into the experiment. The difference in temperature conditions induced by the two different coverings will probably be small, especially if the room temperature is fairly constant during the experiment. It is up to you whether you try to devise ingenious ways of reducing this source oferror but it is not worth spending too long looking for ways ofimproving this particular aspect of the experiment. There is another, quite subtle, factor that needs to be controlled. Light will affect the stems and leaves of the seedlings and the stems and leaves of the two groups ofseedlings will differ quite strikingly in their appearance as a result. It is possible that these changes in the leaves and stems themselves affect the roots. Perhaps a big stem stimulates the growth of a big root, whereas a small stem does not. If this is so then it will not be certain whether any differences which emerge are directly due to the effect of lighton theroots,orwhetherthey aredue to theeffect ofthelightontheleavesandstems, which in turn affect the growth of the roots. To control for this you should cut off the growing shoots from the seedlings. m This experiment should provide an impressive set of data. Before going on to analyse your measurements you will need to consider the hypothesis being tested (as in Section 1). Activity 2.2 State the null hypothesis that is being tested by thisexperiment. State the prediction that follows if the null hypothesis is correct. State the conclusions that you can draw from the different results which you might get. Solution Null hypothesis Light has no effect on root growth. Prediction based on null hypothesis The seedlings grown in the light will not differ in average root length from the seedlings grown in the dark. Possible results and conclusions If thereis a difference in average root lengths between the two groups ofseedlings then the null hypothesis must be rejected and the alternative hypothesis, that light does influence root growth, should be accepted. If there is no difference then the null hypothesis is supported. We suggest that you do not treat all the seedlings in this way. This will allow you lo see how long the roots grow on the seedlings which are not cut (details in Section 2.4). pp?,-l ~ h t t k W t T BACK ANALYSI If you were to discover, however, that the difference appears only in seedlings whose stems have not been cut then you should suspect that light does not directly effect root growth but does so indirectly through its effect on the leaves and stem. 2.2 Setting up the experiment Here are the instructions for setting up the experiment. 1 Empty all the seeds (i.e. at least 40) into a basin containing cold tap water and leave them to soak for an hour. Seeds remain dormant while they are dry; soaking them ensures that all the seeds start germinating at the same moment and that they all get off to a good start. 2 Make sure that the two large and the two small containers which you are going to use are thoroughly clean, and thoroughly rinsed free of detergent, soap or any other contaminant. 3 Put a pint of tap water into each of the large containers. 4 Take the two small containers and if they do not already have holes in the bottom then cut one or two so that the water can easily seep up through the holes (see Figure 2.1). 5 Crumple several lengths of tissue paper and press them firmly into each small container. Continue until both containers are just over-full. The paper should be firm but not jammed solid. 6 Fold several lengths of tissue paper to make a smooth platform, then lay a sheet of high quality (typing or writing) paper on top. Ensure that this platform touches the crumpled tissue paper. Fix this platform of paper over the top of one of the small containers with an elastic band (see Figure 2.2). Repeat for the other small container. Figure 2.2 Typing or writing paper Folded tissue Paper \ Elastic band / Crumpled tissue paper (absorbs water) 7 Stand each small container in a large container (see Figure 2.1). 8 By the time that the seeds have been soaked for one hour (see Instruction l ) the paper in each small container should be thoroughly damp. If it is not then gently spoon over the paper surface some of the water from the big container in which the small container stands. 9 Reject any seeds that seem unusual: for example, those that are a different colour from the rest and those that float rather than sink in the water. 10 Mix the seeds well and then allocate them alternately to two groups until 20 have been allocated to each. 11 Arrange the two groups of 20 seeds as follows: one group on the writing paper surface over each small container. A teaspoon may help you to manoeuvre the seeds into position. Arrange each group of seeds in four rows of five, each row being at least 1cm (half an inch) away from the next (see Figure 2.3). The further apart you can put them>provided that they are spaced regularly, the better. Figure 2.3 12 Toss a coin to select randomly the pot whose seeds will grow in the dark; then make a hood out of the aluminium foil and wrap it over the top of the chosen pot, leaving a space of at least 5 cm (2 inches) between the seeds and the top of the foil (see Figure 2.4). Secure the foil with an elastic band. Repeat the hood-making procedure, using the clear foil, for the second pot. o a O O O 13 Place the two sets of containers side-by-side in a well-lit spot safe from disturbance by children, pets or curious passers-by. Try to ensure that the temperatures of the two groups are the same. This completes the setting-up procedure. 2.3 Maintaining the experiment You also have to undertake a few tasks during the days that follow. Exactly how fast the seedlings develop will depend on the temperature but you can expect to cut the stems off some of the seedlings (see below) after about 2 to 5 days and to measure the root lengths after 3 to 7 days. The maintenance tasks are as follows. 1 To control for any dissimilarity in the temperature of the two groups of seedlings, swap the positions of the two large containers each day. 2 Make sure that there is still plenty of water in the large containers. If they begin to dry out, add another pint of water to both of the large containers. 3 Check the seeds which you can see (i.e. those growing in the light) once a day to see how rapidly the stems and roots are developing. 2.4 Cutting the stems When most of the seedlings growing in the light have grown root hairs and have stems which are a little over 1cm (about half an inch) long, cut the stems off 10 seedlings in each pot (i.e. 20 seedlings in all), Any seedling which has not grown sufficiently for its stem to be cut should be left uncut. Figure 2.5 shows you where to cut. You will see on most of the roots a white, fluffy area which is due to the presence of a multitude of very fine hairs, called root hairs. It is primarily through these hairs that the root absorbs water and nutrients. Use the top of the root, where the root hairs stop, as a reference point and cut across the stem about 3mm ($inch) above this point. Cuts should be made without moving the seedlings using small, sharp-pointed scissors. Any seedling which has not produced any root hairs should be left uncut. When you have finished cutting the stems, replace the aluminium and clear foils on the pots they were previously on, taking care not to squash the seedlings, and return them to their positions (in the large containers) by the window. If the cut stems start to grow rapidly then repeat the cutting process a few days later. Fikre 2.5 Developing leaves \\ Root hairs Cut here with small scissors 2.5 Completing the experiment Leave the pots until the roots of most of the seedlings in the light have grown to at least 1cm (about +inch) long. To measure the root lengths you will need either a pair of dividers, a pair of compasses, or a piece of plasticine and two pins, or needles, assembled as shown in Figure 2.6. Start with one pot of seedlings and working along each row in turn, carefully lift each seedling off the paper, taking great care to ensure that the tip of the root is not left behind. Lay the plant down on a flat and, preferably, dark-coloured surface and straighten the root as far as possible. Measure from the tip of the root to the position near the top of the root where the root hairs stop (see Figure 2.7). You should do this by putting one point of the dividers etc. against the root tip and then moving the other until it lies opposite the position where the root hairs stop. Then measure the distance between the two points of the dividers etc. with the ruler. Measure it in millimetres (to the nearest whole millimetre). Some seeds may not germinate at all and these should be ignored at this stage. This is the trickiest part of the experiment. You may find it difficult to measure the root lengths in this way: if you do then try measuring the roots directly with the ruler instead. Figure 2.6 Embed heads ofpins, or needles, in plasticine about 1 cm apart Figure 2.7 Measure this length I I I l Write the measurement down in the appropriate place in Figure 2.8. For each seedling, indicate very clearly whether or not you have previously cut its stem. Indicate also any seeds that failed to germinate by putting a cross in the appropriate box. Measure all the seedlings in one pot in this way and then repeat the procedure for the second pot. Figure 2.8 Lengths of roots (mm) Light Dark 00000 oontlu Onono DCLARIFY 000017 2.6 Variability and systematic errors Whatever differences might exist between your two groups of seedlings, it is likely that they will not be very large. If they are large then there would be no need to use a hypothesis test to analyse the data. One of the reasons why any differences which might exist may not be obvious, despite all the controls which you have used, is that the seedlings themselves are variable. You may try very hard to ensure that all the seedlings grow in the same conditions but nevertheless there will be some small variations which affect the plants. More importantly, even if they are grown under identical conditions, different plants will still grow differently. Variability was discussed in Section 2.3 of Unit C1 -it is similar to sampling error (see Section 4.1 of u n i t ,431. We have explained that you might find some difficulty in measuring accurately the lengths of the roots. Because you cannot straighten the roots perfectly, it is possible that you will consistently underestimate their length. This kind of error, which is consistent in its direction and approximately constant in its magnitude, is called systematic error. Most scientific experiments are subject to both variability and systematic errors. The task of the scientist (i.e. you) in carrying out an experiment is to discover, despite the presence of variability, bias and systematic errors, whether genuine differences exist between the two groups. One maior source of variabilitv which. as we have exdained.. vou . are verv likelv to experience is that some of your seeds may not germinate at all. When, later in this unit, you calculate the means of the root lengths for your two groups of seedlings, should these non-starters be included? There is no hard-and-fast rule as to how to deal with a problem like this but it would seem sensible in this experiment to exclude them. Thus you will need to answer the following, even more precise, question. Does light affect the root growth of those mustard seedlings which germinate? l Summary In this section we have described how you should set up, maintain and complete a scientific experiment to investigate the growth of mustard seedlings. This led to a discussion of the problems of variability and systematic error which the experiment illustrates. We also considered how you should analyse the data which you will get from this experiment and formulated a statistical hypothesis which you will be able to test using this data. This leads to the question Which hypothesis test should be used to test this hypothesis? We shall answer this question in the next section but now you should set up the experiment as described in Section 2.2. While you are maintaining it as described in Section 2.3, you should work through Section 3 to learn how to analyse the data which you will collect when the experiment is completed (see Section 2.5). 3 The t-test You now need to decide how to analyse the data from the mustard-seed experiment. In this section we shall first consider this decision in the context of the hypothesis tests introduced in Block B, and then we shall introduce a further test which is particularly useful for data like this. 3.1 Which test? In Block B we introduced several hypothesis tests, including the x2 test, the MannWhitney test, the Wilcoxon matched-pairs test and the z-test, and in Unit B5 we summarized how to decide which test is suitable for analysing a particular sample of data. We shall now apply these principles to the data which you will obtain from your mustard seedlings. Activity 3.1 (a) Will the data from your seedlings be from categorical, ordinal or interval scale measurements, i.e. will it be categorical, ordinal or interval data? (b) Are the seeds in the two samples matched in pairs or are they unmatched? (c) Of the tests which you have already met in the course, which would be appropriate for analysing your seedling data? Solution (a) They will be on an interval scale. Not only is it possible to say that a length of, say, 30mm is longer than a length of 20mm- you can also say how much longer it is. (b) There is no obvious method for matching the seeds from the two samples in pairs so they are not matched. Thus the samples are unrelated. (c) You have come across several different tests which can be applied to two unrelated samples of data measured on an interval scale. Remember that tests which apply to categorical and ordinal data can also be applied to interval data. Thus the x2 test or Mood's median test could be applied, if you were to construct a contingency table from the data by dividing the measurements into categories. However, both these tests need fairly large samples and, unless almost all your seeds germinate, it is likely that your samples will not be large enough. The Mann-Whitney test can also be applied to your data. Although it needs only ordinal data (because it takes account only of the order of the sample values), it can of course be applied to interval data as well. It also works well on small samples. The two-sample z-test can be used on unrelated samples of data measured on an interval scale but, like the test, it requires samples larger than those which you have. Thus the Mann-Whitney test is the most appropriate one for your data. m You could go ahead and carry out a Mann-Whitney test but think for a moment about what it involves. The test statistic is found by taking each value in one sample and counting up how many values in the other sample are less than it. It does not matter how much less they are: that is precisely why it can be applied to ordinal measurements. Now, you will have interval data: in comparing two sample values, you know not only which one is the larger but also how much larger it is than the other. Yet the MannWhitney test does not use this information. This raises two questions. Does this matter? Is there a test which can be used on small samples and does use this information ? The answer to the first question is somewhat complicated (we shall return to it in Section 4) but the second is easier. There is such a test but it can be applied only to data from populations satisfying certain distributional conditions. See Section 4.5 of Unit Cl. Amongst the tests introduced in Block B, the one which comes closest to what is needed is probably the two-sample z-test. It can be used to compare two unrelated samples of interval data. Thus the only limitation which prevents you using it for your seedling data is that the z-test can be used only for large samples. It is worth looking again at the rationale behind the z-test to see how this limitation arises. We shall then show how, and under what conditions, it can be modified to give the test which you require. 3.2 The z-test reconsidered Suppose that you have unrelated samples of data from two populations, whose population means are p, and p,, and that you want to use the two-sample z-test to test the following null hypothesis H . against the two-sided alternative hypothesis H,. H o : p, - p , = 0 H I :p ~ - p , # O . The rationale of the two-sided z-test is as follows. 1 If the null hypothesis were true then the means p, and p, of the two populations would be identical so, on average, you would expect the means from two samples, one taken from each population, to be the same. In other words, on average you would expect the difference between the means of the two samples to be 0. If the means of the two samples are denoted by m, and m,, respectively, then, on average, you would expect that m , - m, = 0. 2 If the null hypothesis were true and you took repeated pairs of samples from two populations like this then you would find that sometimes m , was a bit bigger than m, and that sometimes m~ was a bit bigger than m,. Thus m, - m, would sometimes be a bit bigger than 0 and sometimes be a bit smaller. In other words, m, - m, has a sampling distribution. In Unit B4 we explained that, provided the samples are large enough, this sampling distribution of m, - m, is approximately Normal with a mean value of 0. The standard deviation of this sampling distribution is called the standard error (SE). The value of this standard error depends on the two population standard deviations and also on the two sample sizes. 3 The test statistic z is calculated as follows If the null hypothesis were true then the sampling distribution of the test statistic z would be the standard Normal distribution. 4 Unfortunately the standard error is not usually known because the populations from which the samples are taken are inaccessible and so the population standard deviations are unknown. In Unit B4 we calculated the following estimate of the standard error from the data being analysed For example, neither you nor anyone else has access to the entire population of mustard seedlings. We therefore have to estimate the standard error (SE). where S: and S; are the variances of the two samples and n, and n, are their sizes. 5 So the test statistic z is calculated as follows 6 The null hypothesis is rejected whenever z is far enough away from zero. For example, using a two-sided test at the 5 "/,significance level, the null hypothesis is rejected if The figure 1.96 is the critical value. For large samples the procedure in Stages 5 and 6 above works perfectly well but for small samples, particularly those involving less than 25 values, two problems arise. As we mentioned in Unit B4, the sampling distribution of m, - m, is Normal only for reasonably large sample sizes (at least 25). Thus the z-test does not necessarily work for small samples because the sampling distribution ofm, - m, may not then be Normal. This problem does not arise if the two populations in question themselves have Normal distributions: in this case the sampling distribution of m, - m, is Normal for all sample sizes, however large or small. Thus, if it is reasonable to assume that the populations have Normal distributions then this difficulty is removed. The second problem is in the estimate of the standard error, which is To be more precise, this is the sample estimate of the standard deviation of the sampling distribution of the difference between the sample means: m, - m,. It is used because the actual value of this standard deviation (i.e. the standard error) is not known. For large sample sizes, this estimate is likely to be very close to the true standard error but, even for large samples, it is a number calculated from a sample so it is not the same for all possible samples. Because of this variability between samples, the test statistic has a distribution different from the standard Normal distribution which you would get if you divided m, - m, by its actual standard deviation (i.e. the actual standard error) rather than this sample estimate thereof. If the sample sizes n, and n, are large then will vary very little from one sample to another. Its distribution will have a very small spread and so m, - m, JZ will have a distribution which is very close to the standard Normal distribution. If either n A or n~ is small then this distribution will not be close to the standard Normal distribution: it will be more spread out than the standard Normal distribution, i.e. it will tend to have fewer values close to zero and more values away from zero. This has the following consequence: if you wish to compare small samples then, even if you think that the populations have Normal distributions, you cannot use the z-test. However, under certain circumstances another test is available: it is called Student's ttest or, simply, the t-test. We are still assuming that the null hypothesis (m,- m, = 0) is true. Student investigated what distribution of values you would get for an expression like .\In~ no if you took repeated pairs of small samples from two populations whose distributions are Normal with identical means and variances. ' Since the distribution is not the standard Normal distribution, the letter zcan no longer be used to represent such a test statistic when it is calculated from small samples, so Student used the letter t. It is called Student's t-test after a famous statistician, W. S. Gosset (1876-1937), who published the results of his mathematical research in this area in 1908 under the penname Student. We worked for the brewing company Guinness and was required by the conditions of his employment to remain anonymous. Using the results of his work it is possible to carry out a hypothesis test very similar to the z-test provided certain quite stringent conditions are satisfied. 3.3 The two-sample t-test You might be wondering why we have chosen to introduce yet another hypothesis test for comparing two unrelated samples. There are two reasons. The test in question is very widely used in all kinds of applications of statistics. After you finish this course you will probably meet many statistical techniques which have not been described in the course. So we want to show you how a hypothesis test (which you might well find in a textbook) can be related to the principles of hypothesis testing which you have met in the course, and also to illustrate how to do this. If you were to look in a statistics reference book to find a hypothesis test to compare two small unrelated (i.e. unpaired) samples of measurements on an interval scale, then you might well find something like this. Two-sample t-test : summary Applies to unrelated samples of data measured on an interval scale from two populations, each of which is assumed to have a Normal distribution. Also, the standard deviations of the two populations are assumed to be equal. Denoting the means of the two populations by ,ux and py, the null and alternative hypotheses for the two-sided test are: Ho: PX =Py H i : Px f PY. . The test is carried out as follows. 1 Calculate the sample means mx and my of the two samples. 2 Calculate an estimate S: of the common population variance as follows. Denote the data in the two samples by X and y respectively, and denote the corresponding sample sizes by nx and ny, then S: = C(x - mx)2 -t CO) - nx-l+ny-1 3 Calculate the test statistic t = mx - my J!!;! 4 Look up the critical value tc for sample sizes nx and ny: this is the value corresponding to nx - 1 ny - 1 (the number of degrees of freedom). + 5 For the two-sided test, reject H. in favour of H, if eithert>t, or t<-t,. Otherwise the conclusion is that there is insufficient evidence to reject Ho. The table in Figure 3.1 lists the critical values for this test at the 5 0/, significance level when the number of degrees of freedom, i.e. nx - 1 + n , - 1, is 1, 2, 3,.. . up to 40. Figure 3.1 Critical values for Student's t-test (two-sided) Significance level : 5 % Degrees of freedom Critical value (t,) Degrees of freedom Critical value (t,) This procedure is fairly similar to that for the two-sample z-test but there are several important differences. The critical value for the t-test depends on the sizes of the samples: more precisely it depends on the number of degrees of freedom. The t-test can be applied only if it is reasonable to assume that the populations involved have Normal distributions. The t-test can be applied only if it is reasonable to assume that the population standard deviations are equal. If the population standard deviations are, as we have assumed, equal then the population variances are also equal (i.e. there is a common population variance); this is why, in Stage 2, SZ. is an estimate of this common population variance. This last difference is perhaps the most important so, before looking at some examples of the use of this t-test, we shall show you how to calculate S:. The formula for calculating S: in Stage 2 tells you to 0 add up the squared residuals of the measurements in the first sample (X) from their mean of mx to get E ( x - m,)2 do the same for the second sample to get EO, - my)2 add these two sums together and divide by the sum of the sample sizes each minus one, i.e. by the number of degrees of freedom: nx - 1 n y - 1. + Example 3.1 In an agricultural experiment to investigate the effect of different diets on the weights of calves, eight calves were allocated to two groups which were fed different diets, X and Y. Both diets consisted of milk, hay and manufactured concentrates; the difference between them was that the concentrates in diet Y were different from those in diet X. Unfortunately, one of the calves (on diet Y) suffered from a disease which prevented proper digestion so did not eat very much. That calf was therefore excluded from consideration when assessing the effects of the two diets. The calves were kept in similar conditions and the food intake and weight of each calf was monitored from birth. The allocation of the calves to the two groups was designed to control for birth-weight but was otherwise random. Figure 3.2 contains some of the data collected in this experiment. This procedure is somewhat similar to, but more complicated than, that used to calculate the sample variance of a single sample, which is: variance = sZ = I(.- ,I2 n (see Section 4.2 of Unit A2). Figure 3.2 Average daily weight-gain over jive weeks from birth-date (kg per day) We shall use the two-sample t-test to investigate whether there is a difference between the population means, p, and p,, of the daily weight-gains ofcalves fed on the two diets. For measurements such as these it is reasonable to assume that the populations have Normal distributions and that their standard deviations (and hence variances) are equal. To use the t-test we must first calculate the estimate Sf of this common population variance. Here is one way of going about this calculation: we first calculate C(x - mX)2and CO, - my)2. For the sample fed on diet X Rather than continuing with this calculation, let u ~ . ~ a ufor s e a moment and consider the method of calculation. As with the calculation of the standard deviation, the method we have just used is not the best method to calculate C (X - mx)2.It is much quicker to start by calculating x2 and C x (and hence mx), as we recommended for the standard deviation in Section 4.2 of Unit A2. Then, to calculate x ( x mx)2we can use the fact that C(X - mX)2= E x 2 - nxm:. This is because - and C(X- mX)2= nxsx.2 Thus ( Z ( X - mX)2= nx -- m): = C x 2 -nxm:. Thus it is this second formula which is the most convenient to use. Using this method, as in the similar method for the calculation of the standard deviation, only two columns are required and the mean is calculated at the same time. Do not worry if you cannot follow the algebraic proof: it is closely related to Section 4.4 of Unit A2. So we shall now do the whole calculation for S: using this method. 2.05 Thus m, = -= 0.5125. 4 2.03 Thus my = -= 0.6766667. 3 Thus C(x - m,)' Thus C ( y - my)' = C y 2 - nym: = 1.3769 - 3 X (0.6766667)' = 0.0032667. = E x Z - nXm: = 1.0625 - 4 X (0.5125)' = 0.01 1875. Thus S: = C(x - m,)' + CO) - my)' nx-l+ny-1 - 0'0151417 = 0.0030283. 5 Activity 3.2 Here are some pairs of samples (labelled X and Y). In each case the samples come from two populations (also labelled X and Y) whose means p, and p, are to be compared. The populations can all be assumed to have Normal distributions and each pair of populations can be assumed to have a common standard deviation. For each pair, calculate from the samples the estimate S: of the common variance of the two populations. (a) Two groups of children are asked to solve a simple puzzle in which they have to manoeuvre a ball around an obstacle course and into a hole. One group, X, of children haveseen theobstaclecourse before but havenot been told how tonegotiateit. The other group, Y, ofchildren have not seen the obstacle course but have been told in advance how to negotiate it. The length of time (in seconds) taken by each child to manoeuvre the ball round the course is shown in Figure 3.3. Figure 3.3 Group X Group Y (b) An agricultural experiment to determine the best method of applying fertilizer to fields growing winter wheat produced the data in Figure 3.4. The fields in Group X had some fertilizer applied at one stage of growth of the wheat and some more later whereas the fields in Group Y had the same total amount of fertilizer all applied at the later stage. The yield of grain in kg per hectare was recorded for each field in the experiment. Figure 3.4 Group X This method of calculation is summarized on p. 29. Group Y It is important not to cut this figure as it will be used in further calculations. Solution (4 25 Thus m, = - = 5. 5 Thus C(x - m,)' = C x 2 - nxm: = 151 - 5 X (5)' = 151 - 125 = 26. Thus CCy - mu)' = C y2 - nrm$ = 283 - 5 X (7)' = 283 - 245 = 38. Thus S: = C(x - mxY + CCy - mu)' nx-l+ny-1 2 1.60 Thus m, = -- 7.2. 3 3 1.49 Thus m , = -= 6.298. 5 Thus C(x - mX)' = E x 2 - n X m i Thus Thus S: = 0.3912 - my)' = C y 2 - nymy + 0,91468 3-1+5-1 The next stage (Stage 3) in the t-test procedure is exactly the same as in the z-test: the test statistic is Example 3.1 (continued) For the calves experiment 0.5125 - 0.6766667 - -0.1641667 JW We cut this to 3 decimal places to get t = -3.905. Now, if the null hypothesis ( H o : px - py = 0) is true then we would expect t to be close to zero. Is - 3.905 close enough to zero, or should the null hypothesis be rejected in favour of the alternative hypothesis, that the mean weight-gains differ?This stage of the t-test is much more similar to the Mann-Whitney test than it is to the z-test, in that the answer depends on the sample sizes. This is because the shape of the sampling distribution of the test statistic t varies with the sample sizes. It is therefore necessary to consult a table of critical values. The table of critical values is in Figure 3.1. One noticeable feature of this table is that the first column is headed Degrees of freedom. You have to use the number of degrees of freedom rather than the sample sizes themselves when looking up the critical value. For a two-sample t-test with sample sizes nx and ny the number of degrees of freedom is nx - 1 + ny - 1 (the same as the denominator in the formula for S,$).. If you look at Figure 3.6 you will see that when the number of degrees of freedom is small, the corresponding curve is low in the middle and high (i.e. further from the horizontal axis) at the extremes; as thenurnber ofdegrees offreedom increases, thecurves get nearer the axis at the extremes. Thus the probability of getting a large (i.e. extreme) value o f t (either negative or positive) is bigger when thenumber ofdegrees of freedom is small than when it is large. Example 3.1 (continued) Here nx = 4 and ny = 3. Therefore the number of degrees of freedom is 4 - 1 3 - 1. This equals 5, thus the row of the table to consult is the one + with 5 in the Degrees offreedom column, This gives the critical value t, = 2.571. This means that if the null hypothesis is true then the probability of obtaining a value of t greater than 2.571 is 0.025, or 23%, and the probability of obtaining a value o f t less than -2.571 is also 0.025. So, if the null hypothesis is true then the probability of obtaining avalue of t less than - 2.571 or greater than 2.571 is 0.05, or 5 %(see Figure 3.5). In our example, the value of t was calculated to be - 3.905. Figure 3.5 freedom Sampling distribution of the test statistic t under the null hypothesis: 5 degrees of You met this term in Section 3.A Frame l0 of Unit B3 in connection with the ,y2 test. Activity 3.3 Does this mean that the null hypothesis of no difference between the diets should be rejected at the 5 % significance level? Solution Yes. The value - 3.905 calculated for the test statistic t is less than - 2.571 (i.e. it is more extreme than the critical value) so the null hypothesis should be rejected at the 5 % significance level. In general, the null hypothesis should be rejected whenever the calculated value of the test statistic t is further from zero than the critical value, i.e. whenever the value of t, ignoring any minus sign, is greater than or equal to the critical value t, given in Figure 3.1 for the appropriate number of degrees of freedom. Figure 3.6 Sampling distribution of the test statistic t , for various numbers of degrees of freedom, compared with the standard Normal distribution standard Normal distribution 16 degrees of freedom 5 degrees of freedom 3 degrees of freedom - 1 degree of freedom I -3 1 -2 -1 l I 0 1 2 l C 3 I If you look at the table of critical values in Figure 3.1 more closely then you may notice that, as the number of degrees of freedom increases, the critical value decreases. For example, for 10 degrees of freedom the critical value is 2.228, whereas for 40 degrees of freedom it is2.021. The reasons for this are illustrated in Figure 3.6: for small numbers of degrees of freedom the tails of the distribution of the test statistic t die away less rapidly than for large numbers, so the critical value must be further from zero. This difference can be seen quite clearly (from the curves in the figure) for 1 or 3 degrees of freedom but the curve for 16 degrees of freedom looks very similar to that of the standard Normal distribution. For a larger number of degrees of freedom, the curve (drawn at the scale of the figure) would be indistinguishable from that of the standard Normal distribution. However, even these small differences in the curves produce noticeable differences in the sizes of the tails. The tails of these distributions of t all represent a larger proportion of the distribution than do the tails of the standard Normal distribution. This larger proportion makes the critical value noticeably larger than 1.96: for 16degrees offreedom it is 2.120 and for 40degrees of freedom it is2.021 (this is smaller than 2.120 but still larger than 1.96). Thus, as the number of degrees of freedom increases, the corresponding critical value decreases: the critical values get closer and closer to 1.96 but they never become smaller than 1.96. This rejection rule is for the describe two-sided the test; one-sided we t-test in this course. Activity 3.4 For each of the pairs of samples in Activity 3.2, carry out a two-sided ttest at the 5 % significance level to test the null hypothesis Ho: The population means are equal: p, = p,. against the two-sided alternative hypothesis H,: The population means are not equal: p, # p,. Solution (a) In this case - - 1.118 (cut to 3 decimal places). The number of degrees of freedom is 8 so the critical value t , = 2.306. The value of the test statistic - 1.118 is nearer to zero than the critical value 2.306, so we cannot reject the null hypothesis (of no difference between the population means) at the 5 % significance level. This critical value comes from the row of Figure 3.1 with 8 in the Degrees of freedom column. (h) In this case The number of degrees of freedom is 6 so, from Figure 3.1, the critical value t, = 2.447. The value of the test statistic 2.647 is further from zero than the critical value 2.447, so we reject the null hypothesis (of no difference between the population means) at the 5 % significance level. Often the data from the samples is not presented in detail but the summary measures, mean and standard deviation, are given, as in some of the examples in Unit B4. If the data is presented in this way, we can still carry out a t-test. To do this we need a formula for calculating S: from sx and S,. The formula is This is because C(X- m X ) 2= nxs:. Activity 3.5 This concerns an experiment to investigate two different varieties of lupin: Lupinus arboreus and Lupinus hartwegii. Here are the summaries of the data from two samples: one of each of these varieties. All the plants were grown at the same time in similar conditions in a nursery and the height of each (in metres) was measured on the same day. arboreus sample size nA = 4; mean mA = 1.452; standard deviation S A = 0.061. hartwegii sample size n, = 5; mean m, = 1.023; standard deviation S , = 0.023. Carry out a t-test on these samples to investigate whether there is a significant difference between the heights of these samples of the two varieties. Solution First we calculate the common population variance S: using the above formula as follows. Now, - 12.779 (cut to 3 decimal places). The number of degrees of freedom is 7 thus the critical value t, = 2.365. The value of the test statistic 12.779 is further from zero than the critical value 2.365, so the varieties differ significantly in height. H Before going on to apply the t-test to the mustard seed data, which you should now collect following the instructions in Section 2.4, here are some exercises. Exercise 3.1 Solution on p. 43 Exercise 3.2 A biscuit manufacturer wishes to compare the performance of two biscuit production lines, A and B. The lines produce packets of biscuits with a nominal weight of 300 grams. Two random samples of 15 packets from each of the two lines are weighed (in grams). The sample data is Solution on p. 44 summarized as follows. Line A mean m, = 309.8;standard deviation S, = 3.5. S, = Line B mean m, = 305.2;standard deviation 7.0. Carry out a t-test on this sample data to test whether the two production lines produce packets of different weight. 3.4 Analysis of mustard seedling data When you have collected the data from your mustard seedlings experiment, you will be able to test whether light affects root growth. You should have 10 measurements from seedlings grown in the light and 10 from seedlings grown in the dark: all from seedlings whose stems you cut. Beforecarryingout the I-test on thisdata, you should consider whether you can assume that the populations satisfy the necessary distributional conditions. The only method which you have available to d o this is toexamine thedata which you have collected. We shall now demonstrate how to d o this using the following data, which we hope is not too different from yours. Even if you have not got 10 measurements from each group, you can still carry Out the test-provided you have at least altogether including at least I from each group! Figure3.7 Lengths of roots (in mm) obtained in a mustard seedlings experiment Seedlings grown in light 21 - 39 27 11 X 21 26 1) 12 52 39 X I1 2 5 0 x 82911 X Seedlings grown in dark 22 X 21 X 39 20 X 16 20 X 1 4 3 2 z X 36 2 4 1 2 0 1 1 2 2 M~~~~~~~~~~~which are underlined were obtained from seedlings whose stems were cut during their growth. A cross indicates a seed which did not germinate. Activity 3.6 Prepare a back-to-back stemplot of the lengths of the roots of the two samples of 10 cut seedlings from Figure 3.7 (i.e. of the values which are in underlined). Solution For light sample Eu = 55 EL = 13, For dark sample EU = 4 1 E, = 14. This suggests the following stemplot. Lighf 73 91 991 520 1 2 3 4 151 I n, = l0 1 Dark 467 0228 26 1 4 represents 14mm n, = 10 Both of these samples of data look as if they could well have come from populations with a Normal distribution. Also, their spreads are very similar so the population standard deviations (and hence also variances) could well be equal. With such small samples it is not possible to be more precise than this but there is certainly no strong evidence that the distributional conditions required for the t-test are not satisfied. Nor, again because they are so small, would there be any such evidence even if the samples were less symmetric or had less similar spreads. Thus such a stemplot, together with a large amount of similar data about living things collected by scientists, suggests that we can apply the t-test to these samples of data. The analysis of your mustard seedling data will form part of the tutor-marked assignment covering this unit. Summary This section explained why a two-sample z-test cannot he used when the sample sizes are small, and introduced the t-test which can be used with small samples under certain, quite stringent, distributional conditions. Full details ofthese conditions, together with a summary of the whole test procedure, are on p. 19. Here we shall summarize the calculation of the test statistic. COLLECT Procedure To calculate the test statistic for the two-sample t-test Denote the two samples by X and Y and their sizes by nx and n,. 1 Calculate the sample means m, and m, and the sums of squares E x 2 and 5'. 2 Calculate E x 2 - nxmi and C y 2 - nrm;. 3 Calculate S: by summing E x 2 - nxmi and Cy2- nrm; and dividing by nx - 1 + nr - 1 (i.e. the number of degrees of freedom). Full accuracy should be kept for all these calculations but the value of the test statistic t should be cut to 3 decimal places. 4 Calculate m, - m,. 5 Calculate / S+:-.": - 6 Calculate the test statistic t by dividing mx - m, by I /m. nx nr Cut this to 3 decimal places. After completing this section you should be able to carry out a two-sample t-test for unrelated samples understand the stringent distributional conditions required for the use of the t-test. 4 The t-test continued In this section we shall introduce a version of the t-test which can be used to analyse a single sample of suitable data and then show how this test can be adapted to produce a useful test for data from experiments involving matched pairs. We shall then show that, like other hypothesis tests, these t-tests can be used to find confidence intervals. The section ends by briefly considering more complicated designs of experiment and how these might be analysed but, first, we return to a question which was raised briefly in Section 3.1. 4.1 Why the t-test? You may have wondered why people bother ever to use the t-test. After all, the MannWhitney test can almost always be used instead and the t-test seems to involve calculations which are comparatively lengthy. Moreover, the t-test requires quite demanding conditions: the variance of the two populations from which the samples are drawn must be equal and these two populations must have Normal distributions. In fact, this Normal distributions condition is the key to why the t-test is useful in many circumstances. The t-test is based on the individual data values, rather than just on their order, and hence appears to be using more information from the data than does the Mann-Whitney test. However, the extra information can be used only if a condition on the exact shape of the population distributions, the Normality condition, can be assumed. rests like this, which involve specific distributions, are called parametric. In contrast, non-parametric tests like the Mann-Whitney test and the Wilcoxon matched-pairs test do not require conditions on the exact shape of the population distributions, though they may require some limitations on the distributions. The name parametric arises because, for example, the t-test assumes that only the means and variances of the two population distributions are unknown- the shape must be that of a Normal distribution. (ThemeanandvarianceofaNormal distribution arecalleditsparametem.) ) Also, tests like the Mann-Whitney test can be used for data from ordinal measurements, whilst the t-test can be used only for data from interval scale measurements. For example the MannWhitney test assumes that the two population distributions be the same shape as each other, but it does not matter what shape this is. When the specific conditions for using the t-test are satisfied, the test certainly uses more of the information in the data than does the Mann-Whitney test. This shows up in the fact that the t-test is more powerful than the Mann-Whitney test. When statisticians talk of the power of a test, in a context like this, they mean the ability to pick out a true difference between the means of the two populations. So, judged by this criterion, the ttest is more likely than the Mann-Whitney test to detect a difference between population means when such a difference really exists. Accepting the null hypothesis when it is really false is a type 2 error so the power of a test is a measure of its ability to avoid a type 2 error. To be precise it is the probability that, when using the test, a type 2 error will not occur. Other things being equal, the t-test is less likely to result in a type 2 error than is the Mann-Whitney test. However, the difference between the powers of these two tests is not very great. In general, the power of a test is a measure of that test's ability to reject the null hypothesis when it is really false. The slightly lower power of the Mann-Whitney test seems a small price to pay for the advantage of its simplicity and lack of restrictive conditions. For this reason, many statisticians prefer, whenever possible, to use it, and other non-parametric tests, rather than the t-test. However, the t-test and related tests do have advantages over nonparametric tests, particularly in analysing data from experiments with complicated designs. One other difference between the t-test and the Mann-Whitney test is that the former is a test for differences between means and the latter for differences between medians. If the conditions required by the t-test are satisfied then the population distributions are symmetric because they are Normal. Therefore, in this case the population mean and median are equal. However, in other cases the mean and median may not be equal and it could be important to use the median rather than the mean. 4.2 Checking the conditions There are statistical techniques which help to decide whether it is reasonable to assume that a sample comes from a Normally distributed population. These are often called A Normally distributed goodness-of-fit tests because they test how well the distribution of the data from a population is a population with a Normal distribution. sample fits a specified distribution such as the Normal distribution. Other techniques help to assess whether it is reasonable to assume that two samples of data come from populations whose variances are equal. It is, however, always a good idea to produce stemplots of the measurements from the two samples and use these to check whether it In Section 3.4 we did this for looks as if the samples come from populations which are Normally distributed with the data from the mustard seedlings experiment. equal variances. The problem with all these methods, however, is that with small samples it is very difficult to detect any departure from Normal distributions or from equal variances, even if these really exist. The distributions of the two populations need to have very different variances, or need to be very far from Normal, before these features show up in small samples. Usually, therefore, you must rely on knowledge of similar experiments or the data sources, to decide whether the distributional conditions required by the t-test are likely to be satisfied. For example, if the data was people's pay, or earnings, then it would not be reasonable to assume that the populations were Normally distributed. As was illustrated by Section 2.1 of Unit A2, such data often has A skew distribution is one which is not symmetric. a very right-skew distribution. Fortunately, even if the conditions are not completely satisfied the t-test does not usually give wildly misleading results: the analysis is not usually very much more likely to lead to errors (of either type 1 or type 2) than if the conditions are met. For example, as was discovered by the statistician Boneau. if the two populations from which the samples are drawn have markedly skew distributions then the value of the test statistic t is likely to be slightly smaller than it would be if the populations were Normally distributed. Thus, it is slightly more likely that the null hypothesis will be accepted when it is actually false (i.e. that a type 2 error will occur) if the population distributions are skew than if they are Normal. Most statisticians would probably be happy to accept this risk. Boneau also found that, provided the samples are of equal size, even if the variance of one of the Normally distributed populations were four times that of the other, the values of the test statistic t obtained from samples drawn from those.populations would Here, smaller means closer to zero. be only slightly different from those obtained from two (Normally distributed) populations with equal variances. If, however, the samples sizes are markedly different (e.g. one sample of size 5 and the other of size 15) then the values of the test statistic can behave very erratically. Thus it is advisable to have samples of the same size, or of as similar size as possible, when using a t-test. 4.3 The one-sample t-test The t-test described in Section 3.2 is used to compare two unrelated (i.e. unpaired) samples. w e pointed out there that in many ways the procedure is similar to the z-test for the difference between two population means. You may have wondered whether a ttest like the one-sample z-test could be used to investigate whether the mean of a population, from which a single small sample of data is available, is equal to some particular value: if the sample were large then the one-sample z-test could be used. A version of the t-test can indeed be used for this purpose and the procedure is again very similar to that for the related z-test. For example, in Section 3.1 of Unit B47 the one-samde z-test was used to investigate whether a sample of examinationmarks could come from a population with mean 58. Example 4.1 A tomato grower decides to try out a new variety of outdoor bush tomato. The seed merchant claims that this variety produces an average yield of 4 kg of tomatoes per plant. The grower wants to investigate whether the claim is true because if it is then it would be worth his growing this new variety. He has room to experiment with only 5 plants of the new variety. The yields of tomatoes from each of his five plants, in kg, are 3.6 3.2 3.1 2.6 3.9. If the population mean of the yield (in kg per plant) of tomatoes from plants of this variety is denoted by p, then the grower's null hypothesis is that p is as the seed merchant claims. In symbols H,: p = 4 . Now, the grower needs to see whether there is a significant difference between the mean yield of his 5 plants and the seed supplier's claim. Let us assume for the moment that he has no views about the direction of any difference, thus his alternative hypothesis is two-sided. In symbols H,: p # 4. . If the sample size had been large then the grower could have used the one-sample z-test but the sample is certainly not large enough. He could use the sign test provided he changed his hypothesis to refer to the population median rather than the mean. However, this data is from interval scale measurements and the sign test takes account only of whether each data value is above or below a particular number. Thus the sign test does not use all the information in the data. It is possible to use much more of this information by using a t-test but this can be done only if it is reasonable to assume that the population distribution is Normal. Our tomato grower might well feel that this assumption is justified. Activity 4.1 (a) What test statistic would you use if you were trying to apply a z-test to this data? (b) Why is the z-test not appropriate here? Solution (a) The test statistic would be where m is the mean of the sample data and SE is the sample estimate of the standard error of the sample mean. You would calculate this estimate from the formula where s is the sample standard deviation and n is the sample size. See Section 3.1 of Unit B4. See Section 4.1 of Unit B1. Many measurements of living things are Normally distributed. For example the heights of adult men h Section 1.3 of Unit B4. (b) The z-test is not appropriate here for the same reason that it was not appropriate in Section 3. The sample size is so small that S/& may not be a good estimate of the standard error. Consequently, for such a small sample size, this test statistic has not got a standard Normal distribution. Even though the z-test is not appropriate, the test statistic still seems reasonable. If the null hypothesis were true then the sample mean m would tend to be close to 4, so it makes sense to look at m - 4. If it is close to zero then H, may well be true. If it is far away from zero then perhaps H . is false. Also, it still makes sense to divide m - 4 by an estimate of its standard error, even if the small sample size means that this estimate may not be accurate. For these reasons the test statistic used for this C-testis almost the same as the z-test statistic. The only difference lies in the estimate of the standard error SE. Activity 4.2 Calculate the sample mean m and sample standard deviation s for the tomato grower's data. I CLARIFY 4 Solution The sample mean I Zx m = -. n The calculation of the sample standard deviation by Method 2 uses the following formula Jx. variance = -- m' and s = n Laying the calculations out in a table gives the following (the sample size n = 5). Thus the sample mean is m = 3.28. Thus the sample variance is 10.956 - (3.28)', which is 0.1976. Finally, the standard deviation s = J0.1976 In calculating the test statistic for the t-test the only change from the z-test is that instead of using the sample standard deviation S, you should use Remember that Thus the only difference between s and S is that to calculate S you should divide by n - 1 instead of by n. 2 COLLECT I If you were to use Method 1 to calculate S, by adding up the values (X - m)' directly, then you could calculate S by dividing by n - 1 instead of by n. If, as we advise, you use Method 2 then, instead of calculating you should calculate X' - nm2 n-l Example 4.1 (continued) For the tomato data, where n = 4, The reason for using S instead of s is that, for some purposes, S is a better estimate of the population standard deviation than is S. It has particular advantages in the t-test and related tests. For large samples, i.e. when n is large, it makes very little difference whether s or S is used. For example, if x ( x - m)' were 200 and n were 100 then It is important not to cut this figure as it will be used in further calculations. whilst However, for small samples the difference is more marked. Some books refer to S rather than to s as the sample standard deviation. Also, some calculators which have a key to calculate standard deviations automatically, calculate S rather than S. If you have one of these calculators then you can check whether it calculates s or S by using it to find the standard deviation of the sample of size 2 consisting of two numbers 0 and 2. For this sample, s = 1 and S 1.414. Example 4.1 (continued) For the t-test the sample estimate of the standard error is S/&. Thus the test statistic for the tomato grower is m-4 t = ----(S/& ' i.e. m - 4 divided by this sample estimate of the standard error. This is If H . were true then the population mean would be 4, so the sample mean m would probably be close to 4. Hence m - 4 would be near 0 and so t would be near 0.Thus, as with the z-test, values oft sufficiently far from 0 (positive or negative) result in rejection of the null hypothesis. To know whether to reject the null hypothesis you need a critical value and probably you will not be surprised to learn that this comes from the table of critical values in Figure 3.1. To use this table you need to know what number of degrees of freedom to look up. For the one-sample t-test the rule is: for a sample of size n, the number of degrees of freedom is n - 1. Activity 4.3 Find the critical value at the 5 % significance level for the tomato grower's test. Solution The sample size n is 5, s o the number of degrees of freedom is 4. Thus from Figure 3.1 the critical value is 2.776. As with the two-sided t-test we cut the value of the test statistic t here to 3 decimal places. Example 4.1 (continued) The rejection rule for the tomato grower is therefore: reject H. in favour of H 1 if the sample value of the test statistic t is greater than or equal to 2.776 or less than or equal to - 2.776. Since t = - 3.239, which is less than - 2.776, the tomato grower should reject H. and conclude that, on the basis of his sample, the seed merchant's claim that the new variety yields 4 kg per plant is false. Of course he may have some reservations about this conclusion. Perhaps the seed merchant's claim referred to a growing a season with average weather, and the weather had been bad this year, or perhaps the growing conditions or the amount of fertilizer used were not optimal. Here is a summary of the procedure for the one-sample t-test : the test described here is two-sided. One-sided t-tests can be used but they need a different table of critical values and we shall not describe them in this course. Procedure One-sample t-test (two-sided) The test applies to a sample of data measured on an interval scale when the population from which the data comes can be assumed to have a Normal distribution. 1 Denoting the population mean by p, the null and alternative hypotheses are H,: p # A . 2 Calculate the sample mean m and calculate The second formula is usually easier to calculate. where n is the sample size. 3 The test statistic is 4 The critical value at the 5 % significance level is the value of tc for n - 1 degrees of freedom in Figure 3.1. 5 Reject H. in favour of H , at the 5 % significance level if Otherwise the conclusion is that there is insufficient evidence to reject H o . In scientific experimentation of the kind described in Units C1 and C2, the one-sample t-test is not often appropriate. What makes this test important in experimentation is that it can easily be extended to deal with paired data, e.g. data from experiments using crossover or matched-pairs designs. 4.4 The matched-pairs t-test To illustrate this use of the one-sample t-test we shall look at some data which comes from a paper by Cushny and Peebles published in 1904. These researchers wanted to investigate whether two different forms, L and R, of a drug, hyoscyamine hydrobromide, differ in their capacity to produce sleep. They conducted a crossover trial on ten patients. Five of the patients received form R of the drug first, and their gain in sleep was recorded. After a suitable time had elapsed they were given form L instead and the same measurements recorded. The other five patients received the two drugs in the opposite order. The results are in Figure 4.1. This data was analysed by Gossett (Student) in the 1908 paper in which he published the results which led to the ttest. Hyoscyamine hydrobromide is not now commonly used as a sleep-inducing drug. Figure 4.1 Sleep gained (hours)by the use of' hyoscyamine hydrobromide Patient Form L Form R 1 2 3 4 + 1.9 +0.8 + 1.1 +0.7 5 -0.1 - 1.6 - 0.2 - 1.2 -0.1 +0.1 + 4.4 + 5.5 + 1.6 +4.6 + 3.4 6 7 8 9 10 + 3.4 + 3.7 A positive figure means that the patient got more sleep with the drug than without; a negative figure means that he or she got less sleep with the drug. +0.8 0.0 + 2.0 Note that patients vary considerably in their responsiveness to both drugs. For example, patient 7 is very responsive to both drugs, whilst patient 5 is much less affected. It would not be appropriate to analyse this data using the two-sample t-test from Section 3, or using the Mann-Whitney test, because the data is in matched pairs: there is a pair of sleep gained measurements for each patient. You could therefore use the Wilcoxon matched-pairs test. Activity 4.4 If you were to use the Wilcoxon matched-pairs test on this data, what would be the first calculations you would perform? Solution You would find the difference, for each patient, between the hours of sleep gained using form L and the hours gained using form R. The t-test for matched-pairs starts in exactly the same way. Activity 4.5 For each patient in Figure 4.1 calculate the difference in hours of sleep gained: L - R. Solution Figure 4.2 Sleep gained (hours) Patient Form L Form R Difference ( L - R ) From this stage onwards the Wilcoxon test uses just the diferences - the original data is not used again. So let us also concentrate on these differences. The C-test tests hypotheses about means rather than medians. The null hypothesis is that the two forms of the drug are equally effective: i.e. on average, it does not make any difference which form a patient receives. If we denote the population mean of the hours of sleep gained with form L by p, and the population mean for form R by p, then the null hypothesis is Ho: P L = / . ~ R OR HO: / A L - ~ R = ~ . If the population mean of the difference between the hours of sleep gained with form L and with form R is denoted by p,, then pD = pL - p,. Now the null hypothesis is just that the population mean difference is zero: Assuming that the researchers have no particular belief that the mean difference is likely to be positive rather than negative, so that a two-sided alternative is appropriate, then the alternative hypothesis is Hi: We now have a single sample of 10 numbers, the differences, and a null and an alternative hypothesis about the population mean difference of the population from which this sample of 10 diflerences comes. If it is reasonable to assume that the distribution of this population of differencesis Normal then we can carry out a onesample t-test on these differences. Activity 4.6 Carry out the test. What do you conclude? Solution The hypotheses are Hot p D = O H i : pD#O. First it is necessary to calculate S for the sample of differences. We do this as follows. X xZ Note that this t-test differs from the Wilcoxon test in that the zero difference is included. Thus the mean of these differences Thus S = x2 - nm2 Now, the test statistic is m-A t = ----(S/&) ' with A = 0 since the null hypothesis is H , : Thus t = 1.58 - 0 .V (1.2299955/m) p, = 0. 4.062 (cut to 3 decimal places). The critical value is obtained from Figure 3.1. The number of degrees of freedom is n - 1 = 10 - 1 = 9, so the critical value is 2.262. The value oft is 4.062, which is greater than 2.262. so Hn. is reiected. The conclusion is that the population mean difference is not zero. Thus, on the basis of this experiment it seems that the two forms of hyoscyamine hydrobromide do differ in their effectiveness in increasing sleep. W The procedure is on p. 34. So, to carry out a matched-pairs t-test, the procedure is as follows. Procedure Matched-pairs t-test (two-sided) 1 Calculate the differences between the two values in each pair. 2 The null and alternative hypotheses are Ho: H ~ :P A where p, and p, are the population means of the two populations involved. Replace these by the equivalent hypotheses H o : p ~= 0 HI: h#O, where pD is the population mean of the population of differences between the matched pairs. 3 If it can be assumed that this population of differences has a Normal distribution then apply the one-sample t-test with A = 0 to the sample of differences. 4.5 Other matched-pairs tests This procedure, which turns a one-sample test into a matched-pairs test, can be used on other tests. For example, if you had carried out an experiment which produced a large sample of matched pairs of data (on an interval scale) then you could calculate the difference between the two data values in each pair, and use the one-sample z-test to investigate whether the population mean of the differences is zero. If the samples were small then, in a similar way, the sign test could be used to investigate whether the population median of the differences is zero. Working backwards, the Wilcoxon matched-pairs test can be used to test whether the median of a single sample is equal to a particular number (i.e. it can be used like the sign test in Unit Bl). If the null and alternative hypotheses are Ho: M = O H , : M # 0, where M is the population median, then a sample of data from this population can be analysed in exactly the same way that the sample differences are analysed in the Wilcoxon matched-pairs test. A small adaptation of this technique produces a onesample Wilcoxon test for any null hypothesis about the value of the population median. So you have now met two different tests to use with small samples of matched pairs of data (on an interval scale): the matched-pairs t-test and the Wilcoxon matched-pairs test. Most of the remarks in Section 4.1, where we were comparing the two-sample t-test with the Mann-Whitney test, apply equally to any comparison between the matchedpairs t-test and the Wilcoxon matched-pairs test. For example, if the condition that the differences have a Normal distribution is satisfied then the t-test is more powerful than the Wilcoxon test but the difference in power is not great and in some ways the Wilcoxon test is easier to use. 4.6 Confidence intervals from t-tests As with other hypothesis tests, such as the Mann-Whitney test and the z-tests, it is often important to go beyond the test and obtain estimates in the form of confidence intervals. For example, the tomato grower in Section 4.3 would probably like an estimate of the yield of tomatoes which he is likely to obtain from the new variety using his growing methods. This would help him to decide whether it was worth growing this new variety even though it did not match up to the seed merchant's claim. So the tomato grower wants a 95 % confidence interval for p, the mean yield (in kg per plant) of tomatoes which he will obtain from this new variety. If he were to calculate a confidence interval based on the z-test then he would calculate the sample mean m and the sample standard deviation S . Then the 95 % confidence interval for p would be Recall that 1.96 is the critical value for the z-test and that S/& is the sample estimate of the standard error. However, because his samples are small he had to use a t-test rather than a z-test and for this same reason he must also calculate his confidence interval differently -it must be based on the t-test rather than the z-test. The calculation is almost exactly the same as for the interval based on the z-test, but there are two modifications. He must use the critical value c, from Figure 3.1. This is the same critical value as the one which he used in the hypothesis test in Example 4.1; that for n - 1, i.e. 4, degrees of freedom. He must use the modified estimate S/& of the standard error, just as in Section 4.3. Thus the 95% confidence interval for p is From Example 4.1, m = 3.28, S = 0.4969909, n = 5 and t, = 2.776. Thus Cut to the same level of accuracy as the sample mean. so the tomato grower's 95 % confidence interval is [3.28 - 0.61, 3.28 + 0.61 1. Figure 4.3 This is [2,67, 3.891, 2.67 We can summarize this method as follows. Procedure To find a 95 % confidence interval for the population mean of a Normally distributed population The confidence interval is where n, m, c, and S are as in the procedure for the one-sample c-test. Thus t, is the critical value from Figure 3.1 for n - 1 degrees of freedom. In Section 4.4 we showed that a hypothesis test for a single sample can give a test for the difference between samples of matched pairs of data by applying it to the sample of differences. Similarly, a confidence interval for the difference between the samples can also be found. Activity 4.7 Use the solution to Activity 4.6 to find a 95 % confidence interval for p ~ . Solution Here we have t, = 2.262. = 10, m = 1.58, S = 1.2299955 and the critical value From this we calculate Thus a 95 % confidence interval is [1.58 - 0.87, 1.58 + 0.871. This is [0.71, 2.451. W Figure 4.4 3.89 Now p~ = p, - SO YOU have calculated a confidence interval for p, - p,. In words, p, is the population mean of the difference between the hours of sleep gained by the same patient using form L and using form R of the drug hyoscyamine hydrobromide. Thus you have calculated a 95 % confidence interval for the mean difference between the hours of sleep gained using the two forms of the drug. The two-sample t-test (for unrelated samples) from Section 3 also gives a confidence interval for the difference between the population means. Again the form is similar to that given by the two-sample z-test for the difference between two population means. The two-sample z-test gives this confidence interval (mA- mB)+ 1.96 (m, - mB)- 1.96 The modifications required to apply this to small samples are replace :S and S; by S: (calculated as described in the summary of Section 3); replace 1.96 by the critical value tc from Figure 3.1 for n, - 1 + nB - 1 degrees of freedom. This first replacement can be done only if it can be assumed that the two population standard deviations are equal. This is the condition which we required for the twosample c-test in Section 3. Also, as with all t-tests and related confidence intervals, the populations must have Normal distributions. These replacements result in the following. Procedure To find a 95% confidence interval for the difference between the population means of two unrelated Normally distributed populations with equal standard deviations I The confidence interval is I where n,, ny, m,, m,, tc and S: are as in the summary of Section 3. Thus tc is the critical value from Figure 3.1 for nx - 1 freedom. + ny - 1 degrees of Example 4.2 We can use this procedure and the calculations in Example 3.1 to calculate a confidence interval for the difference in average weight-gain (kg per day) between the calves fed on diets X and Y. nx = 4, ny = 3 m, = 0.512, my = 0.676 Thus /= = 2.571 X = 2.571 X (cut to 3 decimal places) Jr 0.0030283 0.0030283 ,/0.0017665°.0017665 Cut to the same level of accuracy as the sample means. Also m, - my = 0.512 - 0.676 = -0.164. Thus the confidence interval is [-0.164 - 0.108, -0.164 which is [-0.272, -0.0561. Figure 4.5 + 0.1081, Before we describe briefly the analysis of more complicated experiments, here are some exercises to test your understanding of the techniques in this section. Exercise 4.1 Solution on p. 44 Use the matched-pairs t-test to analyse the data from Section 6.1 of Unit C1 on the effect of urastarvin on male patients. The data on the 5 matched pairs of males is in Figure 4.6. Figure 4.6 Experimented group (drug taken) Patient Weight number change (D) Control group (placebo taken) Patient Weight number change (P) Difference for each pair (D - P) Is there a significant difference between the effect of the drug and that of the placebo? Exercise 4.2 Calculate the following confidence intervals based on the data in Exercises 3.1 and 3.2. (a) For the difference p, - h,where p, and p" are the population mean weight-gains (kg) of calves fed on the diets B and H respectively. (b) For the differencep, - p,, where p, and p, are the mean weights of packets of biscuits from the two production lines. 4.7 Analysis of variance So far, most of the experiments discussed in Units C1 and C2 have been relatively simple in design. Each has involved a comparison of two groups, whether they be of people, plants or calves, that are as nearly identical as possible except for one factor of interest, such as a drug treatment, light or diet. However, experiments need not be as simple as this in their design. It is often convenient to compare, not just one new drug or one new diet with a control, but a whole range of them. Also in Section 6.2 of Unit C l , we described a complicated version of the clinical trial from Section 6.1, in which each patient received both drug and placebo one after the other, in a crossover design. In Activity 6.1 of that unit you saw that the data from this trial could not be analysed properly by using simple comparisons between two groups of measurements. Let us look at another, simpler, example in which such multiple comparisons are required. Suppose that agricultural scientists want to try out 6 different diets to see which produces the best development in chickens. It would be cumbersome to carry out 6 separate experiments, each with one of the new diets tested against an existing standard diet. Why not instead test all 6 new diets simultaneously against a single control group fed on the standard diet? This would seem to make sense in terms of efficiency: for instance, only one control group is needed rather than six. Suppose also that the scientists wished to try out these 6 new diets on 4 separate breeds of poultry. If they fed 10 chickens of each breed on each diet then the design of the experiment would be as in Figure 4.9. Figure 4.9 Diet 1 Diet 2 Diet 3 Diet 4 Diet 5 Diet 6 Control diet 10 10 10 10 10 10 10 chickens chickens chickens chickens chickens chickens chickens 10 10 10 10 10 10 10 Breed2 chickens chickens chickens chickens chickens chickens chickens 10 10 10 10 10 10 10 Breed3 chickens chickens chickens chickens chickens chickens chickens 10 10 10 10 10 10 10 Breed4 chickens chickens chickens chickens chickens chickens chickens Solution on p. 45 At the end of the experiment, the experimenters would weigh the chickens and altogether they would have measurements on 280 birds, in 28 groups of 10. Where does one begin to analyse data of this sort? One way might be to carry out a hypothesis test of each experimental group of a given breed against the control group of that breed. That would, however, involve 24 hypothesis tests, and this would be both tedious and undesirable. It would be undesirable because each test would be conducted on a relatively small number of chickens: those in only two of the positions in Figure 4.9, i.e.just 20 out of the 280 chickens. Thus each test would not be very likely to identify a difference as significant unless it was fairly large. It would be undesirable also because, with 24 hypothesis tests and a significance level of 5 %, there is a large chance of wrongly rejecting the null hypothesis in at least one of the tests. The reason for this is as follows. Suppose that you were to carry out a hypothesis test at the 5 % significance level on data from two samples. If the null hypothesis of no difference between the two populations is true then the probability of obtaining a value of the test statistic which led you to reject the null hypothesis is 5 %. If you carried out only one such significance test that leads to rejection of the null hypothesis then you would probably be quite happy to accept this small risk that you have made the wrong decision (i.e.a type 1error). However, still assuming that thenull hypothesisis true, if you were to carry out 100 hypothesis tests, on 100 completely independent pairs of samples drawn from two populations, each at the 5 % significance level, then, on average, 5 % of the 100 tests (i.e. 5 of them) would lead you to reject the null hypothesis although it is true. In summary, if you carry out a large number of hypothesis tests at the 5 % significance level then you are very likely to sometimes reject the null hypothesis when it is true. In the chicken experiment, if it happened that none of the new diets really made a difference to any of the breeds (i.e. all 24 null hypotheses were really true) then it is more likely than not that at least one of the null hypotheses would be wrongly rejected (i.e. that a type 1 error would be made somewhere). It would seem to be much better to use an analysis which can deal with the complete set of data, using information about all 280 chickens. The typical weight of a chicken of one breed could then be based on a larger rather than a smaller sample (size 70 rather than 10). Similarly, the typical effect of each diet could be based on a larger number of birds (40 rather than 10). Moreover, the overall pattern of the results would show whether the effect of the diets varied with the breeds to which they were fed. The sort of experiment just described is very common in agriculture and animal husbandry, and it is therefore not surprising that methods of analysing such data were first developed by scientists working in this area. The technique that is used is called analysis of variance. It would take more space than is available in this course to describe the analysis of variance in detail, although the principles are quite straightforward. It analyses the reasons why the measurements differ from one to another. In the chicken example, the mean weight of all 280 chickens could be calculated. The reasons why the weights ofindividual chickens differ from this overallmeancould then be investigated. Here are some examples of possible reasons. Chickens of one breed are heavier than average, so individuals of that breed tend to weigh more than the overall mean. One diet is particularly poor, so individuals fed on that diet tend to weigh less than the overall mean. Individual chickens, even of the same breed fed on the same diet, will vary in weight. To be more precise, the possible sources of these differences are as follows: the diets the breeds the variability of individual chickens. The analysis of variance makes it possible to estimate the extent to which the differences between the measurements are due to each of these sources and, most important, to assess whether any of the differences observed are due to the diets or breeds having a real effect rather than being due to the random variability of individual chickens. If this These differences are sometimes called sources of variability, in which case the last is called random variability. random variability is rather small but the differences due to the diets are large then the analysis would probably lead to the conclusion that the diets are having a real effect. If, however, the random variability is large and the differences due to both diets and breeds are small then the analysis would probably lead to the conclusion that, on the basis of this experiment, neither diets nor breeds make any difference, i.e. that the small differences which were observed could be due to random variation alone. It might even turn out that one diet is particularly suitable for one breed whilst another diet is particularly suitable for some other breed: in this case further analysis will be required, again using analysis of variance. Analysis of variance can be applied to many kinds of experiment. For example, analysis of variance could be used to investigate the complicated version of the drug trial from Section 6.2 of Unit C l . A slightly more complicated version of the mustard seed experiment from Section 2 of this unit could also be analysed this way. In the experiment which we asked you to carry out, you concentrated on comparing seedlings whose stems had been cut (although you were asked not to cut the stems on a few seedlings just to see what would happen). Another approach would have been to compare the root lengths of seedlings grown in the light and dark, both with the stems on and with the stems cut off. The design for such an experiment would have been as in Figure 4.10. Figure 4.10 Stem not cut Stem cut Grown in light Grown in dark 10 seedlings 10 seedlings 10 seedlings .l0 seedlings There are two reasons why the analysis of variance has been introduced here, at the end of a unit about experiments and t-tests. The statistical techniques used in the analysis of variance are closely related to, but rather more systematic than, those used in the t-test. In addition to an analysis of variance, an experimenter may well carry out one or more t-tests. For example, the analysis of variance might show that the diets vary significantly. A t-test might then be used to analyse the difference between two particular diets in which the experimenter was especially interested. There are considerable difficulties in interpreting the results of such t-tests, especially when there are several possible comparisons which could be made. Nonetheless, the ttest is widely used in conjunction with complicated experiments involving the analysis of variance. However, we do not recommend such use thereof. Summary In this section we stated that the two-sample t-test is not markedly affected by quite wide deviations from Normality in the population under investigation and that it is not greatly affected by unequal variances in the two populations provided that the sizes of the samples involved are nearly equal. We then introduced a one-sample t-test and showed how this and other tests, such as the z-test, give tests for matched pairs of data. Confidence intervals based on t-tests were also described and the principles of an important statistical technique, called analysis of variance, were briefly illustrated. After completing your work on this section you should be able to recognize both samples and areas of investigation in which it would be unwise to use the t-test carry out a one-sample t-test carry out a matched-pairs t-test understand that other one-sample tests, e.g. the sign test and the z-test, can be used as matched-pairs tests find a confidence interval from small sample@)ofdata from population(s) satisfying the distributional conditions required by the t-tests. The number of root measurements in each group is shown as 10. Although of course other numbers could be used, it is important that there is the same number of seedlings in each group. Solutions to exercises Exercise 1.1 (p. 9) (a)(b)(d) All three of these activities are designed to answer specific questions by pursuing specific investigations. Thus they are all scientific experiments provided they are properly conducted and all stages are open to public scrutiny. (C) This is not a scientific experiment since the decision is to be based on a single individual's opinion. It is an expert opinion but it is none-the-less personal and private. Exercise 1.2 (p. 9) (a) This is a measurement experiment. (b) This is an exploratory experiment. You could, however, rephrase it as a hypothesis-testing experiment, as follows. Hypothesis Rapid acceleration and braking causes high petrol consumption. Prediction Accelerating and braking less fiercely will decrease the petrol consumption. Test See what happens to petrol consumption when braking and accelerating are less fierce. Possible results and conclusions Petrol consumption unchanged (or goes up), therefore hypothesis is false. Petrol consumption goes down, therefore hypothesis is supported. (d) This is a hypothesis-testing experiment. It seeks to test a hypothesis about the cause of a phenomenon, as follows. Hypothesis Overweight is caused by overeating. Prediction Eating a lot will cause people to be overweight. Test Measure the weights of people who eat a lot and people who do not and compare these with their ideal weights. Possible results and conclusions People who eat a lot are no more overweight than people who do not, therefore hypothesis is false. People who eat a lot are more overweight than people who do not, therefore hypothesis is supported. Exercise 1.3 (p. 9) (b) (Treating this as a hypothesis-testing experiment.) Null hypothesis Rapid acceleration and braking does not cause high petrol consumption. Alternative hypothesis Rapid acceleration and braking causes high petrol consumption. Test See what happens to petrol consumption when braking and accelerating are less fierce. Possible results and conclusions Petrol consumption goes down, therefore reject the null hypothesis. Petrol consumption unchanged (or goes up), therefore the null hypothesis is supported. (d) Null hypothesis Overweight is not caused by overeating. Alternative hypothesis Overweight is caused by overeating. Test Measure the weights of people who eat a lot and people who do not and compare these with their ideal weights. Possible results and conclusions People who eat a lot are more overweight than people who do not, therefore reject the null hypothesis. People who eat a lot are no more overweight than people who do not, therefore the null hypothesis is supported. Exercise 3.1 (p. 27) Exercise 3.2 (p. 27) Thus the value of the test statistic The number of degrees of freedom is 28, so the critical value t, is 2.048. Thevalue of the test statistic is further from zero than the critical value so we should reject the null hypothesis. Thus the two production lines do appear to produce packets of different weight. Exercise 4.1 (p. 40) We need to analyse the sample of 5 differences (D - P) Cx -12 n 5 Thus the mean difference m = -= -= -2.4. Thus S = C x 2 - nm2 -E - = 4.2778499. Thus the test statistic = - 1.254. The number of degrees of freedom is n - 1 = 4. Thus the critical value is 2.776. The test statistic - 1.254 is closer to zero than the critical value 2.776, so we cannot reject the null hypothesis. Thus there appears to be no difference between the drug and the placebo in their effect on male patients. Exercise 4.2 (p. 40) (a) From Exercise 3.1, t, = 2.026 and Thus = 0.056. Figure 4.7 Thus, since m, - mH = 0.012, the confidence interval is t0.012 - 0.056, 0.012 0.0561. This is [-0.044, 0.068 1. + (b) From Exercise 3.2, t, = 2.048 and Z 4.2. Thus tc/= ~ m, = 4.6, the confidence interval is Thus, since r n [4.6 - 4.2, 4.6 + 4.21. This is [0.4, 8.81. Figure 4.8 %%, * 04 , ,. ~.*-T-m . 1 8.8 Index alternative hypothesis 1.3, 3.2, 4.3, 4.4 analysis of variance 4.7 one-sample z-test 4.3 ordinal measurement 3.1, 4.1 back-to-back stemplot 3.4 Baconian experiment 1.2 Boneau 4.2 parameter 4.1 parametric test 4.1 Pasteur 1.2 Popper 1.3 population mean 3.2, 4.3 population standard deviation population variance 3.2, 3.4 power of a test 4.1 common population variance 3.3 confidence interval 4.6 control 2.1 control group 4.7 critical value 3.3, 4.3 degrees of freedom 3.3, 4.3 distribution of c 3.3 distributional condition 3.1, 3.4, 4.1, 4.3 error 2.6, 4.1 estimate 3.2, 3.3, 4.3, 4.6 experiment 1.1, 2.1 exploratory experiment 1.2 goodness-of-fit test 4.2 Gossett 3.2, 4.4 Guinness 3.2 hyoscyamine hydrobromide 4.4 hypothesis testing 1.3 hypothesis-testing experiment 1.2 hypothetico-deductive experiment 1.2 interval scale measurement 3.1, 4.1, 4.3 Mann-Whitney test 3.1, 4.1 matched pairs 4.4, 4.5 matched-pairs sign test 4.5 matched-pairs c-test 4.4 matched-pairs z-test 4.5 mean 3.2, 4.1 measurement 2.5, 3.1 measurement experiment 1.2 median 4.1, 4.5 mustard seedlings 2.1, 3.4 random variability 4.7 root hairs 2.4, 2.5 S 4.3 Sc 3.3 sample 3.2, 3.3, 4.2, 4.3, 4.4 sample mean 3.2, 3.3, 4.3 sample size 3.2, 3.3, 4.3 sample standard deviation 3.2, 4.3 sample variance 3.2 sampling distribution 3.2 shape 4.1 sign test 4.3 skew 4.2 sources of variability 4.7 standard deviation 3.2, 3.3, 3.4, 4.3 standard error 3.2, 4.3 standard Normal distribution 3.2, 3.3, 4.3 stemplot 3.4 Student's c-test 3.2, 3.3, 4.4 symmetric 4.2 systematic error 2.6 c 3.3 t, 3.3 tail 3.3 test statistic 3.2, 3.3, 4.3 To Autumn 1.1 c-test 3.2, 3.3, 3.4, 4.1, 4.3, 4.4 two-sample C-test 3.3 two-sample z-test 3.2 type l error 4.1, 4.7 non-parametric test 4.1 type 2 error 4.1 Normal distribution 3.2, 3.3, 3.4, 4.2, 4.3, 4.4 null hypothesis 1.3, 3.2, 4.3, 4.4, 4.7 number of degrees of freedom 3.3, 4.3 variability 2.6, 4.7 variance 3.2, 3.3, 3.4 one-sample t-test 4.3 one-sample Wilcoxon test 3.2, 3.4, 4.3 Wilcoxon matched-pairs test 4.5 z-test 3.1, 3.2, 4.3 4.4, 4.5 Acknowledgements Grateful acknowledgement is made to the following for the material in this unit: Figure 1.1 reproduced, by permission, from The Oxford Book of English Verse 1250-1918, Oxford University Press; photographs p. 6 Wellcome Institute, p. 9 courtesy of Sir Karl Popper, p. 18 from Student's Collected Papers, courtesy Biometrika; data Figures 3.2, Exercise 3.1 Ministry of Agriculture, Fisheries and Food. MDST 242 Statistics in Society Block A Unit A 0 Unit A1 Unit A 2 Unit A 3 Unit A 4 Unit A 5 Exploring the data Introduction Prices Earnings Relationships Surveys Review Block B Unit B1 Unit B2 Unit B 3 Unit B4 Unit B5 Statistical inference How good are the schools? .* Does education pay? . Education: does sex matter? What about large samples? Review Block C Unit C1 Unit C 2 Unit C3 Unit C4 Unit C5 Data in context Testing new drugs Scientific experiments s I my child normal? Smoking, statistics and society Review

MDST242 C2 - The Open University

Related documents

Products

Support

MDST242 C2 - The Open University

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib