Introduction and Data Gathering (Chapters 1 – 2) At the end of this lecture, the student should: • • • • • Be able to provide a definition of Statistics. Discuss the role of statistics in research. Be able to state reasons for using statistics. Identify the difference between observational and experimental studies. Be able to organize data into a two-dimensional matrix or array. I hear and I forget I see and I understand I do and I remember Chinese Proverb STA6166-1-1 A Motivating Example: The HIP Trial Breast cancer: common malignancy among women in rich countries. Mammography (screening): is today known to lead to fewer deaths. HIP Trial (1960s). First study to conclusively show merits of screening. • 62,000 women age 40-64 (members of Health Insurance Plan, NY). • Randomized into treatment and control groups; 31,000 in each. • Treatment: an invitation to 4 rounds of annual screening. • Control: received usual health care prevalent at that time. If we compare “screened” (1.1) vs. “refused” (1.5), there’s hardly a difference…? (More later!) STA6166-1-2 What Is Statistics? 1. Descriptive Statistics. Summary measures, such as totals, averages or percentages of measurements, counts, or ranks. Graphics used to present, organize, and summarize data, e.g. pie-charts, histograms, boxplots, scatterplots, etc. 2. Inferential Statistics. The analysis and interpretation of data. Concerned with the extraction of information from data and its use in reaching conclusions (inferences) about a population from which the data are obtained. E.g. confidence intervals, hypothesis tests. We will concentrate on (2), although the distinction will not always be clear. STA6166-1-3 Basic Definitions • • • • • • • • • Experimental unit. The basic object on which measurements are taken. (May be composed of measurement units.) Factors. Variables in an experiment that are set by the investigator. (Controllable.) Response. Variable that is observed in an experiment. (Not Controllable.) Treatments. Conditions constructed from the factors in order to observe the impact on the response. Control Treatment. Benchmark with respect to which the remaining treatments are compared. Population. The set of all measurements of interest. Sample. A subset of measurements taken from the population actually measured. Statistic. A number calculated from the sample, e.g. the sample average, the sample variance. Parameter. A number calculated from the entire population, e.g. the population average, the population variance. STA6166-1-4 Population vs. Sample Using the sample average to make statements about the population average is an example of inferential statistics. Sample Population Inference Descriptive statistical methods: describe the sample. Inferential statistical methods: make statements about the population based on the sample. STA6166-1-5 First Principle of Statistical Inference You make inference about the population from which the sample was obtained. (Seems obvious, but is often forgotten.) In each of the examples below, identify the population being sampled and the inference being made: 1. Study cow grazing behavior. One cow (Daisy) in pasture (A). Randomly select time intervals for observation during month of May. 2. Study capital punishment and homicide rates. Randomly select 100 US cities. Objective is to make causal statements about a process. 3. In a pilot study, 20 runs of a manufacturing process are carried out in the lab. Objective: find out how the process will work in large scale production. 4. Study yield of 3 varieties of winter wheat. Randomly sample 30 farms in Kansas, 10 farms grow variety A, 10 variety B, and 10 variety C. Measure the yield per acre over one growing season. STA6166-1-6 Scientific Method • The pursuit of systematic interrelation of facts by logical arguments from accepted postulates, observation, and experimentation and a combination of these three in varying proportions. Roles of Statistics • Aid in creating the `best' research design with which to generate new data. • Extract the information from the noise or variability at the data analysis step. STA6166-1-7 Logical Arguments • Deductive argument: Conclusion follows with logical necessity or certainty from the premises. Nothing new is revealed because we are arguing from the general to the specific. • Specialization: Moving from a large set of objects, postulates, or events, to consideration of a smaller set of objects or events. • Inductive argument: Discovering general laws by the observation and combination of particular instances. Passing from the specific to the general. • Generalization: Passing from the consideration of one object, postulate, or occurrence, to the consideration of a set of objects, postulates, or occurrences. In statistics we attempt to formalize and use these concepts in a quantitative way. STA6166-1-8 Scientific Progress We gain knowledge by iterating between models and data. Hypothesis Model, Conjecture New Hypothesis, New Model Progress and Understanding Data, Measurements New Data STA6166-1-9 Basic Study Steps • State the problem. What are the questions? • • • • • Devise a plan of solution. What will I do? Implement the plan. This is how I do it? Analysis of data. What happened? Interpretation of results. What does this mean? Reexamination. Is my logic correct? What next? Study design and study implementation may require iteration. STA6166-1-10 Graphical Depiction of Scientific Study Knowledge Base Problem Constraints Objectives & Hypotheses Experiment Sample DESIGN How to measure? DATA Interpretation STATISTICAL ANALYSIS Graphics & Visualization •Modeling •Estimates and Confidence Intervals •Formal Statistical Tests Conclusions STA6166-1-11 Research Design Categories • Census (Complete Enumeration): Every individual in the population of interest is observed. In a census, the sample equals the population. • Observational Studies (Mensurative Experiments): Populations to be compared are defined, and individuals are randomly selected from these populations for measurement. This involves mere data collection; no interference with the processes generating the data. • Experimental Studies (Manipulative Experiments): Individuals in one or more populations are carefully chosen or created to test specific manipulations under highly controlled conditions. Explanatory variables are manipulated; their effect on the response variable(s) is then observed. STA6166-1-12 Observational Study Design • Observational studies are of 3 varieties: – Sample survey: studies a population at a particular point in time. – Prospective study: observes a population in the present using a sample survey, and proceeds to follow subjects into the future. – Retrospective study: observes a population in the present using a sample survey, and collects data about the subjects on events in the past. • The possible presence of confounding variables poses a severe limitation in observational studies. • Confounder. A (non-measured) variable, other than the explanatory variable, that affects the response variable. Confounders may affect both response and explanatory variables, and are outside the control of the researcher. STA6166-1-13 Observational Study Design Example: Study lung cancer rates among smokers and non-smokers. • What are populations of interest? • How will individuals be selected for measurement? • What will be measured? • Which analyses will be performed? • How many individuals are needed? • How large an effect will be considered important? • Are available resources adequate for this study? Many of these questions are answered by subject matter experts, some can be answered by a statistical analysis. STA6166-1-14 Observational Study ( Mensuration Experiment) Population 1 Population 2 Sample 1 Sample 2 What is measured? Characteristics How Selected? 1 1 2 1 3 1 … n 1 x x x x x… x x x x x… x x x x x… x x x x x ... 1 2 x x x x x… 2 2 x x x x x… 3 2 x x x x x… … m 2 x x x x x ... STA6166-1-15 How are individuals selected? • Individually identified (the “sample unit”). • Randomly chosen (no biases introduced in selection). Each possible set of individuals has the same probability of selection (Simple Random Sampling). Special situations allow for increased efficacy of selection. • Stratification (account for an extraneous factor) • Clusters (select natural groups of sample units) • Multi-stage (select large units then parts of units) • Systematic (set pattern) STA6166-1-16 Simple Random Sampling: Example A researcher wishes to determine the prevalence of a disease in a greenhouse of tomato seedlings. Each seedling tested for the disease is destroyed in the process, hence only a minimal number should be tested. Expectations are that only about .01% of the roughly 50,000 seedlings in the greenhouse have the disease. How to select a simple random sample? 1. Number each pot. Use a random number table (or spreadsheet random number generator) to produce a list of numbers, in random order from 1 to the total number of pots. Measure plants in pots whose numbers are selected (difficult). 2. Align pots in rows and columns. Use random number table to select a list of row and column number pairs. Measure plant in pots located in the (row, column) pair selected (easier). Table 13 in Ott and Longnecker. STA6166-1-17 Table 13 in Ott & Longnecker Random number tables are constructed in such a way that, no matter where you start and no matter in which direction you move, the digits occur randomly with equal probability. These numbers can also be generated with statistical software packages. Ex: Greenhouse seedlings 1. Use random number table to select a list of row and column number pairs. Have a total of 100 rows and 500 columns. 2. First two blocks of numbers in Table 13 are: 10480 15011. 3. Moving in 2-digit and 3-digit increments we get 10 and 480. So we select the pot at intersection of row 10 & column 480. 4. The next pot would be at intersection of row 15 & column 11. STA6166-1-18 Simple Random Sample Textbook definition. A simple random sample of n units is defined such that each possible sample of size n is equally likely to be drawn. Practical definition. This sampling principle assures that each unit in the population has the same probability (likelihood) of being selected in the sample. STA6166-1-19 Stratified Sampling Allows us to take into account a factor we already know affects the response of interest. To “remove a source of known variability”. 16 years healthy 22 years healthy 20 years diseased Pine forest: Estimate expected yield from plot. Individuals selected at random within each strata. Variability in diseased subpopulation expected to be much greater than in healthy area. Mean yield greater at 22y than 16y. STA6166-1-20 Cluster Sampling Estimate the average sponge size on natural reefs. REEF 9 25 12 Number of sponges on reef 21 5 14 7 Selecting sponges at random would be very resource inefficient. Cheaper to select reefs (sponge clusters) at random with probability proportional to size. All sponges on selected reefs are measured (a cheap thing to do that increases the sample size easily). STA6166-1-21 MultiStage Sampling Typically large areas or large complex populations can be more effectively sampled in stages. At the first stage, natural or synthetic clusters are selected. At subsequent stages the selected clusters are subdivided into units and samples of these are selected. a. Random Selection b. Systematic Selection random starting point randomly located grid c. Multi-Stage Selection Second stage unit Measurement units First-stage unit Example: National crop yield survey. STA6166-1-22 Greenhouse Example Stratification: Maybe we have observed that plants near the door seem less healthy than those further into greenhouse. Divide room into plants near door and plants “inside”. Random samples from each stratum. Cluster: Suppose plants are arranged on tables. We could select tables at random then examine all plants on each table selected. Note that if one plant on a table is diseased, all plants on table have an increased probability of also being diseased. Multi-Stage: Again suppose plants are on tables. Select some tables at random. Next select a few plants from each selected table for testing. First stage unit is the table. Second stage unit is the plant. Third stage unit could be the leaf on the plant, etc. Systematic: Imagine plants arranged on a large table. Randomly pick a row and column to start. Then, following a systematic route, pick, say, every 10th plant. STA6166-1-23 What is measured? Variable: Apt or liable to vary or change from individual to individual, capable of being varied or changed (factor), alterable, inconsistent, having much variation or diversity, a quantity that may assume any given value from a set of values (the variable’s range). Examples: • • • Plant biomass – varies from plant to plant. Blood arsenic level – varies from person to person. Gender – we are not all male or all female. STA6166-1-24 Types of Variables: Categorical Categorical, classification, or qualitative variable Discrete; essentially describes some characteristic of a sample unit. E.g. color, gender, grade, health status, treatment group. Further subdivided into: • nominal – (think name) arithmetic doesn’t make sense, e.g. gender {M,F} even if coded {0,1}; • ordinal – (think order) nominal data with order, e.g. grades {A,B,C,D,F}, strength of agreement {1=strongly agree, 2=agree, 3=neutral, 4=disagree, 5=strongly disagree}. In ordinal data the order is meaningful, but the difference between responses isn’t. Also, arithmetic is sometimes done, but it’s meaning is debatable. STA6166-1-25 Types of Variables: Quantitative Quantitative or amount variable Can be either discrete or continuous; measures the amount or level of a characteristic of a sample unit. For example: age, weight, height, temperature, biomass, volume. Further subdivided into: • interval - differences between values have meaning but there is no definite or meaningful zero point, e.g. GPA, SAT scores, temperature; • ratio – like interval but with a meaningful zero point, e.g. weight, money, yield. In this course we will deal primarily with quantitative variables (ratio). STA6166-1-26 Study Design Questions • How is the response (effect) to be measured? • What characteristics of the response are to be analyzed? • What factors influence the characteristics to be analyzed? • Which of these factors will be studied in this investigation? • How many times should the basic experiment be performed? • What should be the form of the analysis? • How large an effect (effect size) will be considered important? • What resources are available for this study? Are they adequate? It is important to be able to define the underlined words. STA6166-1-27 Terminology • The response typically refers to the measured variable(s) of primary interest (e.g. weight, health status, growth, etc). • Characteristics – Is it change in the average response, the spread of responses, the maximum response, etc, that will be examined? These characteristics typically refer to some “statistical” aspect of effects measured among individuals in the populations being studied. • A factor refers to the characteristic(s) that primarily differ among the populations being studied (compared). Some factors we cannot manipulate (I.e. such as descriptors like gender, geographic location, genetic makeup). Other factors identify characteristics we have caused to be different between the two populations (as in an experiment where we manipulate the populations by giving them different “treatments”). • Basic Experiment – The selecting of an individual for measurement. In an observational study, the basic experiment is the selection and measurement of an individual from the population. In an Experimental Study, the basic experiment is the selection of an individual from the “pool”, the application of a treatment, and the measurement of responses. STA6166-1-28 Terminology (Cont) • By the form of the analysis, we refer to the statistical procedure(s) that match the characteristics of the study design, the characteristics of the responses measured and the estimates and hypothesis tests needed to answer the questions of interest. So, when someone asks “What form will your analysis take?” you might answer with something like “I will be using regression analysis (the statistical method) to explore associations between fat intake and cholesterol level (the hypotheses of interest) between two populations identified geographically and by gender (study design factors).” • The size of the effect of interest refers to how big of a difference must there be before we would conclude that there is a “real” difference. Typically we are interested in specifying this at the design phase of a study since the size of the effect of interest drives the sample size question. Thus if you say a difference of less than 2 points in cholesterol level between gender groups would not be important, but anything greater than 2 is large enough to be noteworthy, you could use this to set the study sample size. If the difference were raised to 10 points, a much smaller sample size would be needed. • Resources – Money, personnel, time, access, material. STA6166-1-29 Experimental Study • Manipulation Experiment: A research design in which the researcher deliberately introduces certain changes in the levels of factors that are hypothesized as affecting the process of interest, and then makes observations to determine the effect of these changes. • Experimental Design: A study plan which assures that measurements will be relevant to the problem under study. • Treatments: Changes to those factors which are suspected of affecting the process under study. STA6166-1-30 Ex: Factorial Experiment Nitrogen Level FACTORS LEVELS Phosphorus Level 0 kg/ha 10 kg/ha 20 kg/ha 0 kg/ha 0/0 10 / 0 20 / 0 10 kg/ha 0 / 10 10 / 10 20 / 10 EXPERIMENTAL UNIT (PLOT) TREATMENTS SITE 1 (block 1) 0 / 10 10 / 0 20 / 10 10 / 10 20 / 0 0/0 SITE 2 (block 2) 10 / 10 20 / 10 10 / 0 0/0 0 / 10 20 / 0 BLOCKED LAYOUT (complete block - all treatments in each block) STA6166-1-31 Standard Form for a Data Set Observation Number 1 2 3 . . . n 1 1 1 . . . 1 CATEGORIES AMOUNTS F F M RED WHITE BLUE x x x x ... 10.2 x ... 12.9 x ... 20.1 x x x x ... x ... x ... F BLUE x x ... 16.0 x x ... strata gender color Other categorical variable weight Other quantitative variable STA6166-1-32 Example Data Set in Spreadsheet Format OBS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 ITEMP IRH 24.47 24.47 24.47 24.45 24.45 24.45 24.68 24.68 24.68 24.79 24.79 24.79 25.03 25.03 25.03 24.44 24.44 24.44 24.43 24.43 24.43 25.24 25.24 25.24 25.35 25.35 25.35 IWB 64 64 64 50 50 50 50 50 50 51 51 51 74 74 74 74 74 74 73 73 73 78 78 78 89 89 89 20.2 20.2 20.2 18.55 18.55 18.55 18.45 18.45 18.45 18.57 18.57 18.57 21.6 21.6 21.6 21.22 21.22 21.22 21.2 21.2 21.2 21.91 21.91 21.91 23.78 23.78 23.78 FWB REP 20.25 20.25 20.25 18.6 18.6 18.6 19.52 19.52 19.52 18.2 18.2 18.2 21.8 21.8 21.8 21.5 21.5 21.5 21.76 21.76 21.76 22.06 22.06 22.06 24.01 24.01 24.01 BIRD 1 1 1 2 2 2 3 3 3 4 4 4 1 1 1 2 2 2 3 3 3 4 4 4 1 1 1 BN 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 IBT 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 ATBT 40.6 40.6 40.9 40.3 40.4 40.1 41.1 41.2 40.9 39.8 39.6 39.8 39.8 39.8 39.4 40.1 40.1 40 39.4 39.8 39.5 . . . . . . Indicator of missing data 39.7 40.2 39.4 40.1 39.4 39.2 40.5 40.8 40.9 39.4 39.4 39.8 38.9 38.7 39.4 39.6 39.8 39.6 39.9 40.2 39.2 . . . . . . WEIGHT 2.21 2.265 2.185 2.275 2.264 2.205 2.343 2.193 2.238 2.32 2.298 2.31 2.212 2.21 2.198 2.235 2.257 2.284 2.33 2.314 2.295 2.149 2.12 2.127 2.213 2.216 2.36 SATBT SITEMP SIWB -1.24351 -1.28723 -1.27434 -0.69343 -1.28723 -1.27434 -1.57355 -1.28723 -1.27434 -0.80345 -1.29196 -1.67386 -1.57355 -1.29196 -1.67386 -1.79358 -1.29196 -1.67386 -0.36338 -1.23754 -1.69807 -0.03334 -1.23754 -1.69807 0.07668 -1.23754 -1.69807 -1.57355 -1.21151 -1.66902 -1.57355 -1.21151 -1.66902 -1.13349 -1.21151 -1.66902 -2.12363 -1.15472 -0.93536 -2.34366 -1.15472 -0.93536 -1.57355 -1.15472 -0.93536 -1.35352 -1.29433 -1.02737 -1.13349 -1.29433 -1.02737 -1.35352 -1.29433 -1.02737 -1.02348 -1.29669 -1.03221 -0.69343 -1.29669 -1.03221 -1.79358 -1.29669 -1.03221 . -1.10503 -0.8603 . -1.10503 -0.8603 . -1.10503 -0.8603 . -1.079 -0.40751 . -1.079 -0.40751 . -1.079 -0.40751 STA6166-1-33 Inventor's Paradox The more ambitious the plan, the more chances of success, and the more opportunity for failure. How does one decide on what to do? Are there open questions ? Are there available resources? Does someone really want the answer? Can a study be done? Will the study be able to answer the question? Statistics may help answer the last question! STA6166-1-34 The HIP Trial Revisited • • • • Seems natural to compare “screened” (cancer rate=1.1) vs. “refused” (cancer rate=1.5), in the treatment group; hardly a difference! But realize that this is an observational comparison (in an experimental study), and hence is prone to confounding. Social status is a confounder. Richer and better educated women were more likely to accept the screening, and breast cancer hits the richer harder than the poorer. (Pregnancy, esp. early pregnancy, is now known to protect against breast cancer.) So the analysis by treatment received is biased. But the analysis by intention-to-treat is appropriate. • “Intention to screen” cancer rate (1.3). • “Control” cancer rate (2.0). • A sizeable difference. • Five-year cancer rate ratio (treat/control) is 39/63=62%. STA6166-1-35