Producing data p163 Anecdotal evidence is based on individual cases often comes to our attention because they are striking in some way. These cases may not represent any larger group of cases. P164 Available data are the data that were produced in the past for some other purpose but that may help answer a present question. P164 However the clearest answers to present questions often require data produced to answer those specific questions. Sampling vs census. 1 Observation vs Experiment p167 An observational study observes individuals and measures variables of interest but does not attempt to influence the response. An experiment imposes a treatment on individuals in order to observe their response. An observational study, even one based on a statistical sample is a poor way to study the effect of a treatment. To see the effect of a treatment we must actually impose the treatment. When our goal is to understand the cause and effect, experiments are the only source of fully convincing data. 2 Design of experiments p170 o The individuals on which, the experiment is done are the experimental units. When the units are human beings they are called subjects. o A specific experimental condition applied to the units is called a treatment. o The explanatory variables in an experiment are called factors. o The values of a factor are called levels. o Many experiments study the joint effect of several factors. In such an experiment, each treatment is formed by combining a specific value of each of the factors. 3 o Placebo effect and control group Example Want to study the effects of aspirin and beta carotene on heart attacks and cancer. Factors: Aspirin (levels: yes, no), Beta carotene (levels: yes, no). Response variables: occurrence of heart attacks, cancer. Treatments are the factor level combinations (4 treatments). 4 The example above is a factorial (two factor) experiment. In principal, experiments can give good evidence of causation. P170 Bias (p173): The design of a study is biased if it systematically favors certain outcomes. An uncontrolled study of a new medical therapy, for example is biased in favor of finding the treatment effective because of the placebo effect. 5 Randomization p173 The design of an experiment first describes the response variable(s), factors (explanatory variables), and the layout of the treatments, with comparison as the leading principle. The second aspect of design is the rule used to assign experimental units to the treatments. Comparison of the effects of treatments is valid only when all treatments are applied to similar groups of experimental units. Systematic differences among the groups of experimental units in a comparative experiment cause bias. 6 Example The use of chance to divide experimental units into groups is called randomization. The design in the above figure combined comparison and randomization to arrive at the simplest randomized comparative design. 7 Principles of experimental designs p175 o Control the effects of lurking variables on the response, simply by comparing two or more treatments. o Randomize- use impersonal chance to assign experimental units to treatments. o Replicate each treatment on many units to reduce chance variation in the results. Statistical Significance 176 An observed effect so large that it would rarely occur by chance is called statistically significant. How to randomize? Hat method, random number tables, software. 8 Example Line 101 of Table B is: 19223 95034 05756 28713 96409 12531. Make random selection of 5 individuals from a group of 30 people. When using Table B, when choosing from 10 subjects, why is it more efficient to use 0 to represent the 10th subject than to select digits 2 at a time? 9 StatCrunch Commands Enter the population (the group) from which you want to select the sample in a column. Use the commands Data > Sample Column and on the resulting dialog box, click on the column that contains the population (the group) from which you want to select the sample. Click “Sample Columns” 10 Completely randomized design (CRD), p178 When all experimental units are allocated at random among all treatments, the experimental design is completely randomized. Examples Cautions about experimentation p179 The study of the effects of aspirin and beta carotene on heart attacks and cancer was double-blind-neither the subjects nor the medical personnel who worked with them knew which treatment any subject had received. The double-blind method avoids unconscious bias, e. g. a doctor who doesn’t think that “just a placebo” can benefit a patient. 11 Lack of realism p180 Matched pairs designs p181 Block design p181 A block is a group of experimental units or subjects that are known before the experiment to be similar in some way that is expected to affect the response to the treatments. In a randomized block design (RBD), the random assignment of units to treatments is carried out separately within each block. 12 Example 3.17 p182. - Progress of a type of cancer differs in women and men. - want to compare 3 therapies. - gender is a blocking variable - two randomizations done, one assigning female subjects to treatments, and the other assigning male subjects. 13 Sampling design p188 Question: want to know what percent of the voting age population consider themselves conservatives. - time, cost, inconvenience forbid contacting every individual - gather information about only part of the group in order to draw conclusions about the whole population. Population and sample p189 - The entire group of individuals that we want information about is called the population. A sample is a part of the population that we actually examine in order to gather information. 14 Sample design The design of a sample refers to the method used to choose the sample from the population. - poor sample design can produce misleading conclusions. - The ABC network program Nightline asked (in a call-in poll) whether the UN should continue to have its headquarters in United States. - more than 186000 callers responded ( telephone companies charge for these calls) and 67% said “No”. 15 - People who spend time and money to respond to call-in polls are not representative of the entire adult population. In fact they tend to be the same people who call radio talk shows. - people who feel strongly, especially those with strong negative opinions, are more likely to call. - it is not surprising that a properly designed sample showed that 72% of adults want UN to stay. Voluntary response sample p190 A voluntary response sample consists of people who choose themselves by responding to a general appeal. - voluntary response samples are biased because people with strong opinions, especially negative opinions are, are most likely to respond. 16 - Random selection of a sample eliminates bias giving all individuals an equal chance to be chosen. Simple Random Sample p191 A simple random sample (SRS) of size n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected. How to select an SRS? Software, Hat method, Random number tables Example 17 Probability sample p193 A probability sample is a sample chosen by chance. We must know what samples are possible and what chance each sample has. E.g. SRS Stratified Random Sampling p193 - Divide the population into groups of similar individuals, called strata. - Then choose a separate SRS in each stratum and combine these SRSs to form the full sample. Example Multistage sampling design p194 Example -Data on employment/ unemployment are gathered by the Govt.’s Current Population Survey. 18 - Conducts interviews in about 60 000 households each month. - Its not practical to maintain a list of all US household from which to select an SRS. - Cost of sending interviewers to the widely scattered households in an SRS would be too high. - So use multistage design. The Current Population Survey sampling design: Stage 1. Divide US into 2007 geographical areas called primary sampling units (PSU). Select a sample of 754 PSUs. Stage 2. divide each PSU selected into smaller areas called “blocks”. Stratify blocks using ethnic and other information and take a stratified sample of the blocks in each PSU 19 Stage 3. Sort the housing units in each block into clusters of 4 nearby units. Interview the households in a random sample of these clusters. 20 Cautions about sample surveys p194 Undercoverage - Sample surveys require an accurate and complete list of the population (sampling frame). Because such lists are rarely available, most samples suffer from some degree of undercoverage. - undercoverage occurs when some groups in the population are left out of the process of choosing the sample. - Example: A sample survey of households will miss homeless people, prison inmates, students in dormitories. - An opinion poll conducted by telephone will miss the 6% of American households without residential phones. Nonresponse - Nonresponse occurs when an individual chosen for the sample can’t be contacted or doesn’t cooperate. 21 Response bias p195 The behavior of the respondent or the interviewer can cause response bias in sample results. - respondents may lie, especially if asked about illegal or unpopular behavior. The sample then underestimates the occurrences of such behavior in the population. - answers to questions that ask the respondent to recall past events are often inaccurate because of faulty memory. Wording of questions p197 Confusing or leading questions can introduce a strong bias in a sample survey and even minor changes in wording can change a survey’s outcome. 22 Statistical inference p202 Parameters and statistics A parameter is a number that describes the population. A parameter is a fixed number, but in practice we do not know its value. A statistic is a number that describes a sample. The value of a statistic is known when we have taken a sample, but it can change from sample to sample. We often use a statistic to estimate an unknown parameter. Sampling distribution p205 The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population. 23 Example 3.29 p214 - we simulate drawing SRSs of size 100 from the population of potential customers. Suppose that in fact 60% of the population have interest in the product. Then the true value of the parameter we want to estimate is p = 0.6. 24 Bias p206: A statistic used to estimate a parameter is unbiased if the mean of its sampling distribution is equal to the true value of the parameter being estimated. The variability of a statistic is described by the spread of its sampling distribution. The spread is determined by the sampling design and the sample size n. Managing Bias and Variability. P208 To reduce bias, use SRS. To reduce the variability of a statistic from an SRS, use larger samples. 25