Statistics 111 - Lecture 2 Collecting Data Surveys and Sampling/ Graphs of a Single Variable May 27, 2008 Stat 111 – Lecture 2 Sampling and Graphing 1 Administrative Notes • Lecture notes on website • Office hours today from 3-4:30pm • Homework 1 available on website • Due at beginning of class on Monday, June 1 • JMP “how to” guide for the homework on website May 27, 2008 Stat 111 – Lecture 2 Sampling and Graphing 2 Outline for First Half of Lecture • Introduction to Sampling • Voluntary Response Samples • Simple Random Samples • • • • • Sources of Sampling Bias More complicated sampling schemes Preview of Inference Bias versus Variability Read: Section 3.3 May 27, 2008 Stat 111 – Lecture 2 Sampling and Graphing 3 Survey Definitions • Population: entire group of objects or people about which information is sought • Census: survey of an entire population • Sample: survey that examines only a portion of the population • Parameter: a numerical characteristic of the population • Statistic: a numerical characteristic of the sample May 27, 2008 Stat 111 – Lecture 2 Sampling and Graphing 4 Why Sample? • Expense: cheaper than a census • • Time: quicker than a census • • Nielson ratings: based on 5000 out of an estimated 105.5 million US households with TVs Exit polls: gives news agencies valuable (?) information on election day in order to project election before all votes (census) are counted Sampled units must sometimes be destroyed (or changed) to measure characteristics • Reliability studies: testing lifetime of light bulbs, strength of windshields, etc. May 27, 2008 Stat 111 – Lecture 2 Sampling and Graphing 5 Sampling Bias • Systematic errors that result in a sample that is not representative of the overall population of interest • Just like in experiments, we must be cautious of potential sources of bias in our sampling results May 27, 2008 Stat 111 – Lecture 2 Sampling and Graphing 6 Voluntary Response Samples • People choose to be included in sample themselves by responding to a general appeal • Eg. Amazon consumer ratings • Results are often biased because people with strong opinions (usually negative) are more likely to respond and be included in the sample May 27, 2008 Stat 111 – Lecture 2 Sampling and Graphing 7 Hite Report: Women and Love (1987) • Hite mailed 100,000 questionnaires to groups of women professionals, counseling centers, church societies, senior citizens centers. Only 4.5% were returned • 84% of women are “not satisfied emotionally with their relationships” (p. 804) • 70% of all women “married five or more years are having sex outside of their marriages (p. 856) • 95% of women “report forms of emotional and psychological harassment from men with whom they are in love relationships” (p. 810) • 84% of women report forms of condescension from the men in their love relationships (p. 809) May 27, 2008 Stat 111 – Lecture 2 Sampling and Graphing 8 Simple Random Sampling (SRS) • Just as an experiment can be improved by randomization, so can sampling • Each individual in the population has an equal chance of being included in the sample • Does not allow self-response or evaluators to influence makeup of the survey (kinda like double-blinding in experiments) May 27, 2008 Stat 111 – Lecture 2 Sampling and Graphing 9 Example: Presidential Elections • In 1912 Literary Digest began using surveys to predict US presidential elections • “The poll represents 30 years constant evolution and perfection...” • In the 1936 Roosevelt vs Landon election, they polled 10 million voters: • 1,293,669 said they would vote for Landon • 972,897 said they would vote for Roosevelt • Reality: Landslide victory (61% to 37%) for FDR • What went wrong? May 27, 2008 Stat 111 – Lecture 2 Sampling and Graphing 10 Biases in Random Samples • Randomization doesn’t correct for certain problems with sampling • Bias 1: Undercoverage: some groups in the population are left out of the process of choosing the sample • Bias 2: Nonresponse: sampled individuals can not be contacted or do not cooperate • Eg. 1936 presidential polls • Low response rate: less than 25% of responded • Undercoverage of poorer demographics: sample of voters relied heavily on lists of automobile and telephone owners, which were generally more affluent voters • Well, at least we learned from those mistakes, right? May 27, 2008 Stat 111 – Lecture 2 Sampling and Graphing 11 Recent Presidential Elections • Using exit polls, several networks reported early that Gore would win Florida on 2000 election • Using exit polls, several pundits predicted Kerry would win Ohio in 2004 election • In general, we have gotten better, but still can make mistakes (especially when difference itself is so small) May 27, 2008 Stat 111 – Lecture 2 Sampling and Graphing 12 More Potential Problems with Surveys • Response Bias: respondents may not answer truthfully to survey questions • Illegal or unpopular behavior such as drug usage • Controversial topics such as teen sexual activity • Race or gender of interviewer can influence answers about race or gender-related questions • Respondents often have trouble remembering past events eg. yearly nutrition and health surveys May 27, 2008 Stat 111 – Lecture 2 Sampling and Graphing 13 More Potential Problems with Surveys • Wording of questions can be confusing or intentionally lead the respondent • Do you favor a ban on disposable diapers? • It is estimated that disposable diapers account for less than 2% of the trash in today’s landfills. In contrast, beverage containers, third-class mail and yard wastes account for 21% of the trash in landfills. Given this, would it be fair to ban disposable diapers? • Complicated multi-part forms that require lots of skipped questions lead to a drop off in response May 27, 2008 Stat 111 – Lecture 2 Sampling and Graphing 14 More Complicated Random Surveys • Weakness of simple random sampling is that you cannot use extra information about population (similar to blocking in experiments) • What if you know a particular group is missing from your sample? • Stratified random sampling: individuals are divided into groups called strata • Simple random sampling done within each stratum • National surveys can be even more complicated by using multistage sampling (cheaper) May 27, 2008 Stat 111 – Lecture 2 Sampling and Graphing 15 Dinner and Drugs Study • Study by CASA that linked frequent family dining to reduced risk of substance abuse “There is no more important thing that a parent can do” • Some problems with study that relate to what we know about surveys and observational studies • Problem 1: undercoverage of minority groups • Survey not representative of teen population May 27, 2008 Stat 111 – Lecture 2 Sampling and Graphing 16 Dinner and Drugs Study II • Problem 2: high level of non-response in survey • Many households declined to answer, didn’t complete survey or denied permission to use • Problem 3: observational study with lots of potential confounding variables • Drug use itself wasn’t measured, but rather a risk score for drug use • Study isn’t adjusted for age, which is also associated with drug use • No proof of causation! May 27, 2008 Stat 111 – Lecture 2 Sampling and Graphing 17 After Break • Exploring Data: Graphical summaries of a single variable • Moore, McCabe and Craig: Section 1.1 May 27, 2008 Stat 111 – Lecture 2 Sampling and Graphing 18 Break! • 5 minutes • More awesome statistics to come May 27, 2008 Stat 111 – Lecture 2 Sampling and Graphing 19 Outline for Second Half of Lecture • Characteristics of Distributions • Center, spread, shape, outliers • Plotting Distributions of Data • Boxplots • Histograms (no stem and leaf plots) • Density Curves • Read: Section 1.1 January 29, 2008 Stat 111 - Lecture 4 - Graphing 20 Definitions • Variable: any characteristic that takes different values for different individuals • Categorical variables place an individual into one of several groups • Examples: gender, race • Quantitative variables take on numerical values that are usually considered as continuous • Examples: height, age, wages January 29, 2008 Stat 111 - Lecture 4 - Graphing 21 Distributions • A distribution describes what values a variable takes and how frequently these values occur. • The distribution of a variable can be described graphically and numerically in terms of: • Center: where are most of the values located? • Spread: how variable are the values? • Shape: is the distribution symmetric or skewed? Are there multiple peaks or just one? • Outliers: are there certain values that seem surprisingly large or small? January 29, 2008 Stat 111 - Lecture 4 - Graphing 22 Barplots and Pie Charts • For categorical variables, we can graph the distribution using bar plots and pie charts January 29, 2008 Stat 111 - Lecture 4 - Graphing 23 Barplots and Pie Charts • Pie charts are generally not as useful as bar plots • Need to have all categories to make a pie chart • harder to compare subsets of categories • Scale of pie charts can sometimes be misleading • harder to see small differences January 29, 2008 Stat 111 - Lecture 4 - Graphing 24 Boxplots • Box plots are an effective tool for conveying information of continuous variables • Box contains the central 50% of the data, with a line indicating the median • Median is the value with 50% of data on either side • Whiskers contain most of the rest of the data, except for suspected outliers • Outliers are suspiciously large or small values January 29, 2008 Stat 111 - Lecture 4 - Graphing 25 Boxplot: Shoe Size of Stat 111 Class • • • • Almost all values are between 5 and 13 50% of values are between 7.5 and 10 Center (Median) is around 8.5 Couple of suspected outliers: 14 and 14.5 January 29, 2008 Stat 111 - Lecture 4 - Graphing 26 Summary of Boxplots • Useful for displaying center and spread of a distribution, as well as potential outliers • However, boxplot doesn’t really give us much of an idea of the shape of the distribution • Histograms are much better graphical summaries of shape • We’ll see boxplots again in Chapter 2, for comparing distributions across groups January 29, 2008 Stat 111 - Lecture 4 - Graphing 27 Histograms • Histograms emphasize frequency of different values in the distribution • X-axis: Values are divided into bins • Y-axis: Height of each bin is the frequency that values from that bin appear in dataset January 29, 2008 Stat 111 - Lecture 4 - Graphing 28 Another Example: Height in Stat 111 • Vertical axis is sometimes the density (or relative frequency) : equal to the frequency of the bin divided by the total number of obs January 29, 2008 Stat 111 - Lecture 4 - Graphing 29 Histograms versus Boxplots • Both graphs give a good idea of the spread • Boxplots may be a little clearer in terms of the center and outliers in a distribution center outliers center spread of likely values January 29, 2008 Stat 111 - Lecture 4 - Graphing 30 Histograms versus Boxplots • Histograms much more effective at displaying the shape of a distribution • Skewness: departure from left-right symmetry • Multi-modality: presence of multiple high frequency values clearly not symmetric not symmetric? possible second peak? January 29, 2008 Stat 111 - Lecture 4 - Graphing 31 Symmetry - Histograms vs. Boxplots January 29, 2008 Stat 111 - Lecture 4 - Graphing 32 Density Curves • Often easier to examine a distribution with a smooth curve instead of a histogram • Example: vocabulary scores from 947 seventh graders in Gary, Indiana January 29, 2008 Stat 111 - Lecture 4 - Graphing 33 Example with Test Score Data • Number of scores less than 6 in population is 287 out of 947, so relative frequency is 0.303 • Using a density curve (normal distribution), the approximate frequency is 0.293 January 29, 2008 Stat 111 - Lecture 4 - Graphing 34 Approximations • Real data will never exactly fit a density curve ie. be exactly symmetric or normally-distributed • We will talk later in course about how to fit these density curves and we will use them to make probability calculations January 29, 2008 Stat 111 - Lecture 4 - Graphing 35 Next Class - Lecture 3 • Using JMP • Exploring Data: Numerical summaries of a single variable • Moore and McCabe: Section 1.2 January 29, 2008 Stat 111 - Lecture 4 - Graphing 36