Chapter 1: Data Collection 1.1 Introduction to the Practice of Statistics 1.2 Observational Studies, Experiments, and Simple Random Sampling 1.3 Other Effective Sampling Methods 1.4 Sources of Errors in Sampling 1.5 The Design of Experiments September 3, 2008 1 Definition of Statistics Given a question, statistics is the art and science of designing studies, collecting the data, summarizing the data, and then analyzing the data to draw conclusions. In particular, statistics is: • collecting data • organizing this data • summarizing the organized data • analyzing the summarized data • draw conclusions from this analysis Section 1.1 2 Data Data is information that is collected about a generic population (people, animals, machines, etc.). In the social sciences it is usually about people: the characteristics (height, weight, age, etc.) or attitudes (believes, political opinions, religion, etc.). 3 Types of Statistics • Descriptive Statistics: This type of statistics uses graphs, tables, charts and the calculation of various statistical measures (mean, standard deviation, etc.) to organize and summarize information about a population. This is material in Math 127A. • Inferential Statistics: This type of statistics consists of techniques (hypothesis testing, confidence intervals, etc.) to reach conclusions about a population based upon information obtained by a subset of the population. This is the material in Math 127B. 4 Average Yearly Temperature in Nashville Question: Is the climate of Nashville warming? The average temperature of Nashville is available National Weather Service website from 1872-2007. Average daily temperature is calculated by summing the highest and lowest hourly temperature and then dividing by 2. The monthly average temperature is obtained by the computing the average of the daily average temperatures and yearly average temperature is obtained by computing the average of the monthly temperatures. 5 Mathematica Notebook 6 The Statistical Method (QDDI) • Question: What is the problem of interest? Identify your research objective. • Design: How will the data be collected? From whom? About what? • Description: Give the characteristics of the data. This is were mathematics can play a major role. Summarize the data. Give a graphical description of the data. (Descriptive Statistics) • Inference: What does the data tell us? If you started with a hypothesis, does the data confirm this hypothesis? (Inferential Statistics) 7 Example Harvard Medical School studied 22,000 male physician to determine if taking aspirin could prevent heart attacks. The physician were split into two equal groups: 11,000 would receive an aspirin per day and the other 11,000 would receive a placebo. The assignment of physicians was done randomly. During the course of the study, 0.9% of the male physicians in the study who were taking aspirin had a heart attacked and while 1.7% taking the placebo experienced a heart attack. They then used the statistical method to predict that if all male physicians could have participated in the study, the percentage having a heart attack would have been lower for those taking aspirin. 8 QDDI • Question: Does taking aspirin each day reduce the incidence of heart attacks in male physicians? • Design: Take sample with half taking aspirin and half taking a placebo. This is called an experiment. • Description: Heart attack rate: aspirin (0.9%) versus placebo (1.7%). • Inference: All male physicians would benefit from taking daily aspirin. 9 Terminology of Statistics • • • • • Population: A population is the complete collection of all elements to be studied. Sample: Any subset or group of a population is called a sample. Variable: A variable is characteristic of the individuals in the population that will be analyzed. Parameter: A parameter is numerical summary of a variable for the population. Statistic: A statistic is numerical summary for a variable obtained from a sample of the population. 10 Types of Data • Quantitative data is composed of measurements (numbers) about the population. • Categorical (or qualitative) data is data that can be separated into categories and can be identified by some non-numeric characteristic. • Continuous data is quantitative data that can take any value. • Discrete data is quantitative data is not continuous . 11 Example • Population: All of the students in Math 127A that are in WH 103 today. • Sample: The students in Row 10 of the classroom. • Variables: – – – – – Color of eyes Month of birth Home state Age Religion 12 Example (continued) • Data (Qualitative/Qualitative): – – – – – Blue eyes October Georgia 18 Lutheran • Parameter: – The average age. – The standard deviation of heights. • Statistics: – The average age of students in Row 5. – The fraction of students with blue eyes in Row 9. 13 Data for Statistical Studies • Census: A census is list of all individuals in a population along with certain characteristics of each individual in the population (e.g., age, race, home ownership, etc.). • Observational Study: An observational study attempts to measure a characteristic of the population by examining a sample, but does not manipulate the sample. An observational study often uses a sample survey to collect data. • Experimental Study: An experiment selects a sample of the population and manipulates one or more variables of the population. The variable that is manipulated is called an independent variable and variable that is effected is called a dependent variable. Section 1.2 14 Census Website http://www.census.gov 15 Observational Study Observational Study: An observational study measures the characteristics of a population by studying a sample of individuals. It attempts to find connections between these characteristics without manipulation of the sample. The study is passive or ex post facto. 16 Design of Observational Studies 17 Example of Sample Survey Sample Survey: A random sample of 10,000 people were the individual are interviewed to determine information about the following variables of the population: • age • race • gender • number of children • income bracket ($0-$25K, $25K-$50K, ….) • wealth bracket • homeowner Question: Is there a relationship between homeownership and number of children? 18 Algorithm for Setting Up a Sample Survey • Step 1: Identify the population from which the sample is to be drawn. • Step 2: Compile a list of subjects in the population from which the sample will be taken. This is called the sampling frame. • Step 3: Specify a method for selecting subjects from the sampling frame. This is called the sampling design. • Step 4: Collect the data. 19 Designed Experiments Experimental Study: An experiment is a study in which data is used and manipulated to determine the effects of one or more variables (called explanatory variables) on another variable (called the response variable). That is, the explanatory variable is controlled to see how the response variable changes with changes in the explanatory variable. The conditions placed on the explanatory variable are called treatments. In this type of study, the explanatory variable is sometimes called a factor of the experiment. 20 Design of Experiments 21 Remark Observational studies are useful for detecting connections between two variables in a population. Experimental studies are useful to determine the nature of the connection. 22 Types of Sampling • Random (good) • Non-random (bad) Examples: Suppose that our population is 200 students who are seated in a classroom of 10 rows with 20 seats per row. If we chose a sample as the subset of students who sit in the rows that end with an even integer, then this would be a non-random sample. Suppose that we place 10 balls each marked with a separate number (1-10) in a bag. We would generate a random sample of 20 by choosing one of the balls out of the bag and using the number on the ball as the row for our sample. Section 1.3 23 Simple Random Sample Simple Random Sampling: each individual in the population has the same or equal chance of being selected for a sample as any other individual. A list of individuals in the population from which a sample is to be drawn is called a frame. 24 Two Sets of Random Numbers Generate a set of 100 random numbers (1 - 9) : S {8, 1, 7, 1, 2, 7, 6, 4, 4, 5, 9, 6, 5, 4, 9, 9, 2, 4, 6, 6, 6, 7, 4, 2, 1, S = {1, 6, 6, 9, 3, 1, 6, 3, 5, 5, 4, 4, 4, 9, 2, 1, 1, 7, 6, 3, 2, 8, 1, 5, 4, 8, 8, 7, 5, 9, 2, 6, 6, 7, 2, 8, 1, 4, 1, 4, 9, 2, 7, 2, 8, 7, 4, 4, 1, 9, 8, 6, 4, 9, 8, 1, 3, 7, 5, 7, 9, 6, 1, 8, 1, 6, 8, 8, 6, 2, 5, 1, 6, 9, 6, 5, 8, 8, 2, 9, 9, 6, 8, 6, 2, 9, 8, 1, 1, 8, 2, 9, 1, 9, 3, 9, 4, 5, 2, 2, 5, 3, 5, 3, 5, 5, 5, 2, 8, 1, 2, 4, 2, 2, 7, 4, 2, 8, 8, 2, 4, 3, 9, 3, 7, 3, 2, 5, 1, 1, 6, 7, 4, 6, 9, 1, 8, 4, 1, 8, 5, 9, 6, 3, 7, 5, 4, 1, 9, 9, 5, 3} 7, 2, 4, 1, 1, 4, 7, 4, 7, 7, 9, 9, 2, 4, 4, 9, 3, 6, 6, 6, 4, 1, 6} Frequency Chart of Numbers 25 Types of Samples Simple Random Sample: A sample that is obtained by randomly choosing individuals in the population. Stratified Sample: A stratified sample is sample that is obtained by separating the population into non-overlapping groups (call strata) and then randomly selecting individuals from each stratum. Systematic Sample: A systematic sample is a sample that is obtained by selecting individuals in the population is a systematic way e.g., every 5th individual. Cluster Sample: A cluster sample that is obtained by selecting all individuals with a randomly selected subset or group of the population. Convenience Sample: A convenience sample is a type of sample that is drawn because it is easy or convenient to collect. Convenience samples are likely to under represent portions of the population. They may not be random and may contain bias due to time or location. Section 1.3 26 Three Main Sampling Methods Random Cluster Stratified 27 Advantages of Different Random Sampling Methods • Simple Random Sampling: Gives a good picture of the whole population. • Cluster Random Sampling: Often it easier and cheaper to implement because subjects are close together and welldefined once clusters are chosen. • Stratified Random Sampling: Guarantees that each stratum (segment) is sampled. 28 Sources of Errors in Sampling Fact: Erroneous conclusions can be drawn from observational or experimental studies due to faulty statistical design and sampling. • Non-sampling Errors: These errors occur when the sampling process (design) are faulty. This usually occurs when there is a problem with the sampling frame or sampling design. In other words, preference is given to selecting some individuals over other individuals in the population. response errors non-response errors processing error analysis errors coverage errors • Sampling or Estimation Errors: This error occurs when the sample gives an incomplete picture of the population. This type of error is due to the fact that we are using a sample instead of the whole population. Section 1.4 29 Non-sampling Errors • Response Errors: Poor questionnaire design, interview bias, respondent errors, poor survey process. For example, the organization of the survey could be confusing, individuals give deceptive responses to questions, the data collector may not speak the language of the individual to be interviewed, etc. • Non-response Errors: Complete or partial non-response. For example, individuals may agree to be interviewed, but then choose not to answer some or all of the questions. • Processing Errors: There are computational errors in coding, capturing, editing and presenting the final data. • Analysis Errors: Incorrect statistical tests are applied to the data resulting in erroneous conclusions. • Coverage Errors: There are errors in the duplication or omission of individuals in the sample. 30 Non-sampling Bias Example: Suppose we are interested the approval rating of Mayor Dean and we will conduct a random telephone survey on whether citizens of Nashville approve or disapprove of his job performance since he took office. Is there bias in this sample survey? Answer: Maybe, since it will miss citizens who do not have a telephone and this group of people may have different opinions about the mayor than those who do have a telephone. 31 Design of Experiments Review from Section 1.3: An experiment is a study for the collection of data that is used to determine the effects of one or more variables (called explanatory variables) on another variable (called the response variable). The individuals from which the data is collected are called subjects or experimental units. The conditions placed on the explanatory variable are called treatments. In this type of study, the explanatory variable is sometimes called a factor. An experiment is called double-blind if the subjects and the experimenter do not know which treatments are being administered to each subject. We say that the experiment is completely randomized if each experimental unit is randomly assigned to a treatment. A randomized experiment comparing medical treatments is called a clinical trial. Section 1.5 32 Types of Experiments • Completely Randomized Design: Each experimental unit is randomly assigned a treatment. • Randomized Matched-pairs Design: Experimental units are paired with each experiment unit in the pair assigned a different treatment. The matched-pair can be the same individual so that the individual receives both treatments (e.g., before and after). • Randomized Block Design: Experimental units are grouped together in groups. Units in each group (block) are randomly assigned treatments. 33 Example Object of Study: Does aspirin reduce the heart attack rate? Population: Male physicians in the U.S. Sample: 20,071 male physicians between the ages or 40 and 84. Study: The sample was split in two groups. One group took an aspirin per day and the other group took a placebo. The doctors were randomly assigned to these two groups. The doctors were monitored over a 5 year period. Explanatory Variable: aspirin: yes or no (categorical) Response Variable: heart attack: yes or no (categorical) Type of Experiment: Completely randomized design. 34 Example (continued) Yes No Total Aspirin 104 10,933 11,037 Placebo 189 10,845 11,034 Total 293 21,778 22,071 This is an experiment and the aspirin/placebo are the treatments. We manipulated the explanatory variable to see the effect on the response variable. 35 Example (continued) Fraction of Heart Attacks for both Treatments Yes No Aspirin 0.0094 0.9906 1.0 Placebo 0.0171 0.9829 1.0 36 Example (continued) Conclusion from Study: The heart attack rate per 1000 male physicians is 9.4 for those taking aspirins and 17.1 for those not taking aspirin. Hence, we would conclude that taking aspirin reduces the heart attack rate. 37 Matched-pairs Designs A matched-pair design experiment is a study where there are only two treatments and experimental units are matched. One experimental unit receives one treatment and the other experimental unit receives the second treatment. The pairs may be the same individual (before treatment and after treatment) or it may be two individuals who have similar characteristics (e.g., gender, age, etc.). The assignment of the treatments to each pair should be random. 38 Example of Matched-Pairs Purpose: Study the effect of taking caffeine one half hour before swimming. Sample: 50 randomly chosen swimmers. Explanatory Variable: A caffeine pill or a placebo. Response Variable: Time to swim one mile. Study Design: Experiment Matched-pair Design: The 50 swimmers are selected. Each swimmer is randomly given the caffeine pill or the placebo and swims one mile with the time recorded. After 1 week, the same 50 swimmers return and are given the treatment that they did not receive the previous week. They swim the mile and the time is recorded. Each swimmer’s times is compared against both treatments. 39 Blocks and Block Designs • • • • • A collection of experimental units that have the same (or similar values) on a key variable is called a block. In the previous example, each subject (person) is a block. Experimental units are divided into groups (blocks) and each treatment is randomly assign to one or more of the units in each block. In other words, a block design identifies blocks before the start of the experiment and assigns subjects to treatments within those blocks. To reduce bias, order of treatments within each block is randomized and we call this a randomized block design. A matched-pair design is a special type of block design. Here each paired experimental units form a block. In a block design study, an experimental unit (subject) may receive only one treatment. 40 Example of Block Design Purpose: Study the effect of taking caffeine one half hour before swimming. Sample: 50 swimmers, but 16 males who swim competitively, 14 males who do not swim competitively, 8 females who swim competitively and 12 females who do not swim competitively. Explanatory Variable: A caffeine pill or a placebo. Response Variable: Time to swim one mile. Study Design: Experiment Randomized Block Design: We create four blocks (16, 14, 8, 12 subjects). Within each block, individuals take either the caffeine pill or the placebo. Each subject’s swim time is recorded. The times of each swimmer within each block as well as across the blocks are compared (caffeine pill versus placebo). 41 What type of experiment? A drug company wanted to test a new arthritis medication. The researchers found 200 adults aged 25-35 and randomly assigned them to two groups. The first group received the new drug, while the second received a placebo. After one month of treatment, the percentage of each group whose arthritis symptoms decreased was recorded and compared with their original condition. What type of experimental design is this? 42 What type of experiment? A medical journal published the results of an experiment on insomnia. The experiment investigated the effects of a controversial new therapy for insomnia. Researchers measured the insomnia levels of 86 adult women who suffer moderate conditions of the disorder. After the therapy, the researchers again measured the women's insomnia levels. The differences between the the pre- and post-therapy insomnia levels were reported. What type of experimental design is this? 43 What type of experiment? A farmer wishes to test the effects of a new fertilizer on her tomato yield. She has four equal-sized plots of land--one with sandy soil, one with rocky soil, one with clay-rich soil, and one with average soil. She divides each of the four plots into three equal-sized portions and randomly labels them A, B, and C. The four A portions of land are treated with her old fertilizer. The four B portions are treated with the new fertilizer, and the four C's are treated with no fertilizer. At harvest time, the tomato yield is recorded for each section of land. What type of experimental design is this? 44 What type of experiment? A random sample of 1,000 overweight male adults is recruited. Each male is weighed and his weight is recorded. Each individual is given a diet and are told to follow it for one month. After one month, each individual is weighed and recorded. The “before” and “after” are compared. What type of experimental design is this? 45 What type of experiment? A random sample of 30 Vanderbilt students is selected. We are interested in the reaction times when using or not using a cell phone during driving. Each student’s reaction time was measured when he or she was using or not using a cell phone on a driving course in a Vanderbilt parking lot. What type of experimental design is this? 46