MATH 10: Elementary Statistics and Probability Chapter 1: Sampling and Data Tony Pourmohamad Department of Mathematics De Anza College Spring 2015 Introduction Population and Sample Data Types What is Statistics? Statistics The collection of methods for planning experiments, obtaining data and then organizing, summarizing, analyzing, interpreting, presenting and drawing conclusions based on data. • Statistics is the study of data! • Statistics helps us answering questions that arise in several fields: . Ecology: How many animals of a given species live in a particular area? . Health: Is a new drug effective against a disease? Is drug A more effective than drug B? . Environmental Sciences: Weather forecasts and prediction of extreme environmental patterns. 2 / 28 Introduction Population and Sample Data Types Steps in Learning From Data We will learn about: • Designing the data collection process (experimental design) • Preparing and analyzing the data (descriptive statistics, models, hypothesis testing, prediction) • Reporting conclusions We will study: • Probability: Theoretical aspects of randomness! What is the probability that I select an ace of spades out of a deck of cards? • Inference: Learning from data, taking randomness into account using probability ! Given the heights of the people in this room, what is the average height of ALL statistics college students? 3 / 28 Introduction Population and Sample Data Types Example: Nitrogen Fertilizers • Study of the effects of nitrogen fertilizer in wheat production: 15 fields and 5 nitrogen fertilizers. Three fields were randomly assigned to each of the nitrogen fertilizers. The same variety of wheat was planted in all fields and they were cultivated in the same manner. The number of pounds of wheat per acre was recorded. • Goal: determine the optimal level of nitrogen to apply to any wheat field. • After determining the amount of nitrogen that yielded the largest production of wheat, the farmer concluded that similar results would hold for fields with the same characteristics as the ones in the study. • Is this conclusion justified? 4 / 28 Introduction Population and Sample Data Types Population and Sample • In statistics, we generally want to study a population. Population You can think of a population as a collection of persons, things, or objects under study. . How do we study the population? • We study a population by selecting a sample. Sample A sample is a portion (or subset) of members from the population. Normally collected using a random method. 5 / 28 Introduction Population and Sample Data Types Parameters and Statistics Parameter A parameter is a number that is a descriptive measurement of the population. • Typically parameters are unknown. Why is that? Statistic A statistic is a number that is learned/observed/computed from a sample. 6 / 28 Introduction Population and Sample Data Types Example: Parameters and Statistics • Examples of a parameter: . Proportion: In a recent quarter, 39% of all De Anza College students were over age 25. . Average: In a recent quarter, the average age of all De Anza College students was 27.1 years. . Median: In a recent quarter, half of all De Anza College students were 22 years old or younger • Examples of a statistic: . Proportion: 41% of a random sample of 200 students were over age 25. . Average: The average age of a random sample of 200 De Anza College students was 28.3 years. . Median: In a sample of 200 De Anza College, 50% of the students were 22 years old or younger 7 / 28 Introduction Population and Sample Data Types Questions to Ask Yourself • Does a sample statistic always have the same numerical value as the population parameter? Why or why not? • Are sample statistics equal for all samples taken from the same population? Why or why not? • Does a sample statistic always have a different numerical value as the population parameter? 8 / 28 Introduction Population and Sample Data Types Variables and Data Variables A variable is the description in words of the characteristic of interest • A variable will be a sentence, not a number. Data The data are the information collected about the variable for individuals in the population or sample. • The variable is a sentence explaining the question you are asking in order to obtain information. • The data are the information obtained as answers to the question. 9 / 28 Introduction Population and Sample Data Types Example #1 Suppose we are studying the commutes of De Anza college students to school from home. • What is the population? • What is one possible example of a sample? • Some examples of variables and data: . Variable: "the distance that a student commutes to De Anza College" . Data: 2.5 miles, 8.4 miles, 0.25 miles, 52 miles, . . or . Variable: "how a De Anza student commutes to school" . Data: car, car, bus, bikewalk , car, bus, bus, car , bike, car 10 / 28 Introduction Population and Sample Data Types Data Types -- Quantitative Data • Quantitative Data: Consist of numbers representing counts or measurements. • There are two types of quantitative data: . Discrete data: the number of possible values is either a finite number or a countable number. E.g., number of animals of a given species. . Continuous data: are the result from infinitely many possible values not restricted to certain specified values (such as integers). E.g., height. 11 / 28 Introduction Population and Sample Data Types Data Types -- Qualitative Data • Qualitative Data: Also called categorical or attribute data. These can be separated into different categories that are distinguished by some non-numeric characteristics. • There are two types of qualitative data: . Nominal data: Data that consist of names, labels or categories only. These data cannot be arranged in an ordering scheme. E.g., blood type. . Ordinal data: Data that can be arranged in some order but differences betweeen data values either cannot be determined or are meaningless. E.g., shirt size. 12 / 28 Introduction Population and Sample Data Types Examples Consider causes of death in the US in 1992. Below is a list all causes and the number of lives that each one claimed. We ordered the causes and assigned consecutive integers. Rank Cause Total 1 Heart Diseases 717,706 2 Malignant neoplasms 520,578 3 Cerebrovascular Diseases 143,769 4 Pulmonary Diseases 91,938 5 Accidents 86,777 6 Pneumonia influenza 75,719 7 Diabetes 50,067 8 HIV 33,566 9 Suicide 30,484 10 Homicide 25,488 13 / 28 Introduction Population and Sample Data Types Examples • Discrete data examples: . The number of new cases of breast cancer reported yearly from 1995 to 2002. . The number of cows in a field . The number of students in this room • Continuous data examples: . Temperature . Age . Weight 14 / 28 Introduction Population and Sample Data Types Examples • Nominal data examples: . Colors: blue, red, green, yellow, etc. . Cars: Toyota, Honda, Subaru, Ferrari, etc. . Feelings: Sad, happy, mad, etc. • Ordinal data examples: . Letter Grades: A, B, C, D, F . Weight: Small, medium, large . Speed: Slow, average, fast 15 / 28 Introduction Population and Sample Data Types Data Gathering There are many ways to gather data, such as: • Census: Collection of data from every member of the population. • Sampling: Collecting data from a sub-collection of members from part of the population. Normally collected using a random method. • Observational Study: Collect data with NO CONTROL over possible affecting factors. • Designed Experiment: Data are collected by means of an experiment where the most important factors are subject to control. • Examples? 16 / 28 Introduction Population and Sample Data Types Designed Experiments vs. Observational Studies In a New York Times article about hormone therapy for women a reporter wrote that "researchers say observational studies painted a falsely rose picture of hormone replacement because women who opt for the treatments are healthier and have better habits to begin with that women who do not." 17 / 28 Introduction Population and Sample Data Types Designed Experiments • Treatments: Different values or components of the explanatory variable applied in an experiment. • Response: The dependent variable in an experiment; the value measured for change at the end of an experiment. • Control Group: A group that receives an inactive treatment but is otherwise managed exactly the same as the other groups. • Placebo: An inactive treatment that has no effect on the explanatory variable. • Blinding: Not telling participants what treatment a subject is receiving. • Double Blind: The act of of blinding both the subjects of an experiment and the researchers who work with the subjects. 18 / 28 Introduction Population and Sample Data Types Big Problem of Observational Studies Confounding Confounding occurs when the effects of variables are mixed such that the individual effects are indeterminable. When the effects of multiple factors on a response can not be separated, it becomes difficult or impossible to draw valid conclusions about the effect of each factor. • Example: If only the people in a particular age group are given a particular drug, the drug may look effective/ineffective. • Designed experiments can be constructed to avoid confounding variables. 19 / 28 Introduction Population and Sample Data Types Other Problems in Data Gathering • Problems with Samples: . A sample should be representative of the population. . A sample that is not representative of the population is called "biased". . Non-response or refusal of subject to participate OR Self-Selected Samples. . Sample Size Issues. • Collecting data or asking questions in a way that influences the response. • Causality: A relationship between two variables does not necessarily imply that one causes the other. They may both be affected by some other variable. • Self-Funded or Self-Interest Studies • Misleading Use of Data: improperly displayed graphs, incomplete data, lack of context. 20 / 28 Introduction Population and Sample Data Types Example #2 • Study I: Employees of a company are randomly divided into two groups. Group A gets classroom training from an instructor who is available to help and answer questions; Group B gets training via online software with an online discussion board available to get help and answers to questions. • Study II: Researchers are studying whether retirement age affects the rate of memory problems in senior citizens. A survey of retired senior citizens showed that those who had retired earlier tended to have a higher incidence of memory problems after retirement than those who had retired at an older age. 1 For each of the above, what type of study is it? 2 What problem can you see in Study II? 21 / 28 Introduction Population and Sample Data Types Example #2 Continued • Study III: 300 randomly selected individuals are asked if they had been on a diet in the last 8 weeks and how much their weight has changed over the last 8 weeks. Weight change for dieters and non-dieters are compared. • Study IV: 100 individuals are put on a low fat diet, 100 on a low carb diet and 100 eat their normal diet. Their weight change over an 8 week period is recorded. 1 Which weight loss study (III or IV) do you think would give the best information about the effect of diet on weight loss? Why? 22 / 28 Introduction Population and Sample Data Types Example #3 A large city is proposing a parcel tax to support education. Each property owner would be assessed a tax of $100 per property per year. The parcel tax will be voted on by voters in the next election. It will pass if 2/3 of the voters vote in favor of the tax. • I. A group of parents and teachers supporting the parcel tax randomly select and call residents in the city. They identify themselves as members of the Parent Teachers Association for the school system and ask the person who answers the telephone call if they support the parcel tax. • II. A TV news station in the city conducts a "Facebook" survey. Viewers are asked whether they favor or oppose the tax and are given instructions to visit the TV stations Facebook page to respond about their opinion. The poll is publicized and responses are solicited by announcements on the TV station’s evening news. 23 / 28 Introduction Population and Sample Data Types Example #3 Continued • III. A professional polling organization conducts a survey by randomly calling selected residents in the city. If the resident is a registered voter, he or she is asked his/her their opinion about the proposed parcel tax. They are asked whether they favor the tax, oppose the tax, or have no opinion. These three choices are presented to the individual in random order, so that not all respondents hear the choices in the same order. 1 Which survey do you think would produce the most accurate prediction of the election results? 2 For each of the other two surveys, what problems do you think there might be with the information obtained? 24 / 28 Introduction Population and Sample Data Types Randomization and Sampling • Simple random sample of size n: All samples of n members from the population have the same chance of being chosen. • Systematic Sampling: We randomly select a point every k th element of the population. • Convenience Sampling: Collect results that are convenient to get. • Stratified Sampling: The population is subdivided into at least two different groups that share the same characteristics (e.g., age bracket) and then a sample is drawn from each subgroup. • Cluster Sampling: Divide the population into sections (or clusters), then randomly select some of those clusters, and then choose all the members from those clusters. 25 / 28 Introduction Population and Sample Data Types Example #4 Determine the type of sampling method used: 1 To form a recreational soccer team, a soccer coach randomly selects 6 players from a group of boys ages 8 to 10, 7 players from a group of boys ages 11 to 12, and 3 players from a group of boys age 13 to 14. 2 For a survey of human resource (HR) personnel at high tech companies, a pollster interviews all HR personnel in each of 5 randomly selected high tech companies. 3 In a survey of engineering salaries, a researcher selects engineers to interview by randomly selecting 50 women engineers and randomly selecting 50 men engineers. 4 A medical researcher for a hospital interviews every third cancer patient from a list of cancer patients at that local hospital. 26 / 28 Introduction Population and Sample Data Types Example #4 Continued Determine the type of sampling method used: 1 A high school counselor uses a computer to generate 50 random numbers and then selects students whose names correspond to the numbers. 2 A student interviews classmates in his algebra class to determine how many pairs of jeans a student owns, on average. 3 In a study to learn what types of after school child care are used in their district, a school district administrator randomly selects 6 classes at each school and surveys all parents with children in the selected classes. 27 / 28 Introduction Population and Sample Data Types Remember... Data may be useless if not collected in an appropriate way. 28 / 28