STAT 600: 1 - Introduction to Statistics Spring 2014 WHAT IS STATISTICS? Statistics is the science of learning from data. This involves collecting, describing, and drawing conclusions from data. The application of statistics can be divided into two broad areas: Descriptive Statistics: Graphical methods or numerical summaries which describe data. Inferential Statistics: Process in which a smaller group (sample) is used to draw conclusions about a larger group (population). TYPES OF DATA Recall that statistics is the science of data. All data (and hence the variables that we measure) can be classified as one of two types, and each of these has two subgroups. Qualitative (or categorical): Measurements that are classified into one of a group of categories. - Nominal: Order is not important. - Ordinal: Measurements fall in some natural order. Quantitative (or numerical): Measurements that are recorded on a naturally occurring numerical scale. - Discrete: There are gaps between possible data values. - Continuous: There are no gaps between possible data values; that is, the measurements occur on a continuous scale. Example 1.1: The National Center for Health Statistics administers a survey to National Health and Nutrition Examination Survey participants on an annual basis. The survey participants are randomly selected from the U.S. population (http://www.cdc.gov/nchs/nhanes/participant.htm). The following survey items refer to the participant’s dermatological health. Classify each item (variable) as nominal, ordinal, discrete, or continuous. 1 STAT 600: 1 - Introduction to Statistics Spring 2014 Survey Item/Variable How many moles do you have that are at least ¼ inch in diameter? What is your natural hair color? Variable Type When you go outside on a very sunny day for more than an hour, how often do you wear sunscreen (always, most of the time, sometimes, rarely, or never)? How many times in the past year have you had a sunburn? If applicable, diameter of moles or lesions suspicious of melanoma or other malignancies. Example 1.2: Assessing Mercury Levels Found in Fish in Maine Lakes Mercury is a toxic metal sometimes found in fish consumed by humans. The state of Maine conducted a field study of 115 lakes to characterize mercury levels in fish, measuring mercury and 10 variables on lake characteristics. From these data we could investigate potentially investigate the following research questions: 1. Are mercury levels high enough to be of concern in Maine lakes? 2. Do dams and other man-made flowage controls increase/decrease mercury levels? 3. Do different types of lakes have different mercury levels? 4. Which lake characteristics best predict mercury levels? The variables measured by the researchers as part of this field study are listed below. Classify each variable according type. Merc (ppm): Mercury level found in fish fillets in parts per million N: number of fish in the composite Elevation: elevation of the lake (feet) Surf Area: surface area of the lake (acres) Z: maximum depth (feet) Lake type: 1 = oligotrophic, 2 = eutrophic, 3 = mesotrophic ST: lake stratification indicator. (1 = yes, 0 = no) ___________ ___________ ___________ ___________ ___________ ___________ ___________ This refers to whether or not there is temperature stratification within a lake. In summer, the lake surface warms up and a decreasing temperature gradient may exist with the bottom remaining cold. A lake is considered stratified if a temperature decrease of 1 degree per meter or greater exists with depth. DA: drainage area (square miles). ___________ Area of land which collects and drains the rainwater which falls on it, such as the area around the lake. 2 STAT 600: 1 - Introduction to Statistics Spring 2014 RF: RF = (total runoff during year)/(total precipitation during year) ________ Runoff factor (RF) is the amount of rainwater or melted snow which flows in rivers and streams. In general, higher runoff factors may lead to more surface waters from lake watershed reaching lakes. If contaminants are from local source, this may influence concentration found in fish. FR: flushing rate = (total inflow volume during year)/(total volume of lake). ________ Flushing rate (FR) gives the number of times all water is theoretically exchanged during a year. DAM: Dept. of Inland Fisheries and Wildlife impoundment class. 0 = no functional dam present; all natural flowage 1 = some man-made flowage in the drainage area ________ SOME BASIC DEFINITIONS Most of what we’ll be doing in this course centers on trying to understand a set of information. This set of information is from a . . . Population: The complete collection of ALL elements to be studied. The population is often so big that obtaining all information about its elements is either difficult or impossible. So, we work with a more manageable set of data obtained from a . . . Sample: A subcollection of elements drawn from a population. Example 1.3: Consider the National Health and Nutrition Examination Survey mentioned above. Identify the following: Population of interest: Sample: Census: All elements are drawn from the population (hence there is no difference between the population and the sample). Note that inferential methods are not needed when a census of the entire population is taken. Observation: The collection of measurements from a particular unit in a population. Variable: Any measurable characteristic of an observation. 3 STAT 600: 1 - Introduction to Statistics Spring 2014 When creating a data set to be imported into a statistical software package such as JMP, you should place each variable in its own column. Then, each row will consist of a separate observation. Example 1.4: Show how you would construct a data set for the National Health and Nutrition Examination Survey mentioned above. Parameter: A numerical descriptive measure of a population. This value is almost always unknown. 𝜋 or population proportion μ for population mean Statistic: A numerical descriptive measure of a sample. This value is calculated from the observed data. We will use the following notation. 𝜋̂ for sample proportion 𝑦̅ for sample mean Example 1.5: Suppose you are interested in estimating the proportion of the U.S. population that never wears sunscreen when outside for more than an hour on a sunny day. Identify the following: Parameter of interest: Statistic of interest: 4 STAT 600: 1 - Introduction to Statistics Spring 2014 General Approach to Statistical Process 5 STAT 600: 1 - Introduction to Statistics Spring 2014 TYPES OF STUDIES – (from Powerpoint in class and on website) Two Main Types of Studies Observational – researcher collects info on attributes or measurements of interest, but does not influence results. Experimental – researcher deliberately influences events and investigates the effects of the intervention, e.g. clinical trials and laboratory experiments. EXPERIMENTAL STUDIES – basic terms and concepts 1. Completely Randomized Design (CRD) The treatments are allocated entirely by chance to the experimental units. Example 1.6: Tomato Plants Which of two varieties of tomatoes (A & B) yield a greater quantity of market quality fruit? Factors that may affect yield: soil fertility; exposure to wind/sun; soil pH levels; soil water content etc. Divide the field into plots and randomly allocate the tomato varieties (treatments) to each plot (unit). Situation 1: 8 plots – 4 get variety A Situation 2: UPHILL 8 plots – 4 get variety A 6 STAT 600: 1 - Introduction to Statistics Spring 2014 2. Blocking Group (block) experimental units by some known factor and then randomize within each block in an attempt to balance out the unknown factors. Example 1.7: Tomato Plants Again It is recognized that there are two areas in the field — well drained and poorly drained. Partition the field into two blocks and then randomly allocate the tomato varieties to plots within each block. Well-drained Poorly drained How should we allocate varieties to the 12 plots? Example 1.8: Comparing Three Pain Relievers for Headache Sufferers How could we design an experiment? How could blocking be used to increase precision of our experiment? 7 STAT 600: 1 - Introduction to Statistics Spring 2014 Example 1.9: Horse Leg Wraps • 17 “boots” tested, each boot is tested n = 5 times. Why? • Because of the time constraints all boots were not tested on the same day. • 8 tested 1st day, 5 tested 2nd day, 4 tested 3rd day. • Leg was placed in freezer and thawed before the 2nd and 3rd days Horse Leg Diagram: of testing. Questions: What problems do you foresee with this experimental design? What actually happened? Below is a plot of the force readings when no wrap was used on the leg during the three days of testing. What is the implication of the results shown above? 8 STAT 600: 1 - Introduction to Statistics Spring 2014 Final Results of Horse Leg Wrap Study Q: What should have been done? 3. Using People as Experimental Units (Medical Studies/Clinical Trials) Example 1.10: Cholesterol Drug Study Suppose we wish to determine whether a drug will help lower the cholesterol level of patients who take it. How should we design the study? 9 STAT 600: 1 - Introduction to Statistics Spring 2014 Important Concepts for Experiments with Human Subjects • control group: – Receive no treatment or an existing treatment • blinding: – Subjects don’t know which treatment they receive • double blind: – Subjects and administers / diagnosticians are blinded • placebo: – Inert dummy treatment • placebo effect: – A common response in humans when they believe they have been treated. – Approximately 35% of people respond positively to dummy treatments - the placebo effect OBSERVATIONAL STUDIES There are two major types of observational studies: prospective studies and retrospective studies. 1. Prospective Studies Choose samples now, measure variables and follow up in the future, e.g. choose a group of smokers and non-smokers now and observe their health in the future. 2. Retrospective Studies Looks back at the past, e.g. a case-control study Separate samples for cases and controls (non-cases). Why? Look back into the past and compare histories. For example, we could choose two groups: lung cancer patients and non-lung cancer patients. Compare their smoking histories. 3. Cross-sectional Studies Choose samples now, measure variables of interest, some of which may be retrospective in nature. 10 STAT 600: 1 - Introduction to Statistics Spring 2014 Important Note: 1. Observational studies should use some form of random sampling to obtain representative samples. 2. Observational studies cannot reliably establish causation. Example 1.11: “Home Births Give Babies a Good Chance”, NZ Herald, 1990 An Australian report was stated that babies are twice as likely to die during or soon after a hospital delivery than those from a home birth. The report was based upon simple random samples of home births and hospital births. Comments: Example 1.12: “Lead Exposure Linked to Bad Teeth in Children”, USA Today The study involved 24,901 children ages 2 and older. It showed that the greater the child’s exposure to lead, the more decayed or missing teeth. Comments: 11 STAT 600: 1 - Introduction to Statistics Spring 2014 Additional Examples: Example 1.13: Determine whether age at 1st pregnancy is a risk factor for cervical cancer. Example 1.14: Determine what factors might influence the “success” of a duck nest. Example 1.15: Test the toxicity of a new pesticide/herbicide on aquatic organisms. 12