Statistics 13 Elementary Statistics Summer Session I 2012 Lecture Notes 1: An Introduction to Statistics1 Definition 1.1 Statistics is the science of data. This involves collecting, classifying, summarizing, organizing, analyzing, and interpreting numerical information. And more views from others... • ‘There are three kinds of lies: lies, damned lies, and statistics.’ by Mark Twain. • ‘The average human has one breast and one testicle.’ by Des McHale. • ‘Statistics may be defined as “a body of methods for making wise decisions in the face of uncertainty.”’ by W.A. Wallis. • ‘I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding.’ by Hal Varian, chief economist at Google.2 For curious students: ‘How to lies with Statistics’ by Darrell Huff.3 Definition 1.2 Descriptive statistics utilizes numerical and graphical methods to look for patterns in a data set, to summarize the information revealed in a data set, and to present that information in a convenient form. Definition 1.3 Inferential statistics utilizes sample data to make estimates, decisions, predictions, or other generalizations about a larger set of data. 1 Last update: June 25, 2012 New York Times: http://www.nytimes.com/2009/08/06/technology/06stats.html 3 It won’t be tested, but is probably fun to read! Don’t read it when you cannot get into sleep, it may make things worse... 2 1 Inferences make statements about populations (the data set about which information is desired). Definition 1.4 An experimental unit is an object about which we collect data. • A measurement results when a variable is actually measured on an experimental unit. • A set of measurements is called data. Definition 1.5 A population is a set of units that we are interested in studying. The definition of population depends on the particular study. • If you are interested in studying the height of US citizens, the population is all the heights of US citizens. • If you are interested in studying the height of Brazilian citizens, the population is all the heights of Brazilian citizens. Definition 1.6 unit. A variable is a characteristic or property of an individual population How many variables have you measured in your study? • Univariate data: One variable is measured on a single experimental unit. • Bivariate data: Two variables are measured on a single experimental unit. • Multivariate data: More than two variables are measured on a single experimental unit. What is the difference between populations and variables? • A population is a set of existing units such as people, objects, transactions, or events. • A variable is a characteristic or property of an individual population unit such as height of a person, time of a reflex, amount of a transaction, etc. Definition 1.7 A sample is a subset of the units of a population. The four major methods of collecting data: 1. A published source: These data have already been collected by someone else and is available in a published source. 2. A designed experiment: These data are collected by a researcher who exerts strict control over the experimental units in a study. These data are measured directly from the experimental units. 2 3. A survey: These data are collected by a researcher asking a group of people one or more questions. Again, these data are collected directly from the experimental units or people. 4. An observational study: These data are collected directly from experimental units by simply observing the experimental units in their natural environment and recording the values of the desired characteristics. Definition 1.8 A statistical inference is an estimate, prediction, or some other generalization about a population based on informaion contained in a sample. What is the difference between descriptive and inferential statistics? • Descriptive statistics utilizes numerical and graphical methods to look for patterns, to summarize, and to present the information in a set of data. • Inferential statistics utilizes sample data to make estimates, decisions, predictions, or other generalizations about a larger set of data, such as population. Definition 1.9 A measure of reliability is a statement (usually quantitative) about the degree of uncertainty associated with a statistical inference. Example 1 Gallup poll of teenagers A Gallup Youth Poll was conducted to determine the topics that teenagers most want to discuss with their parents. The findings show that 46% would like more discussion about the family’s financial situation, 37% would like to talk about school, and 30% would like to talk about religion. The survey was based on a national sampling of 505 teenagers, selected at random from all U.S. teenagers. • Sample: the set of 505 teenagers selected at random from all U.S. teenagers. • Population: the set of all teenagers in the U.S. • Variable of interest: the topics that teenagers most want to discuss with their parents. 1. How is the inference expressed? The inference is expressed as a percent of the population that want to discuss particular topics with their parents. 2. Newspaper accounts of most polls usually give a margin of error (e.g., plus or minus 3%) for the survey result. What is the purpose of the margin of error and what is its interpretation? The ”margin of error” is the measure of reliability. This margin of error measures the uncertainty of the inference. Definition 1.10 Quantitative data are measurements that are recorded on a naturally occurring numerical scale. Quantitative discrete: 3 • Number of wins by a baseball team in a season (0, 1, 2, 3, etc) • Number of incidences of fire on campus in a year (0, 1, 2, etc) • Number of customers of a store in a day Quantitative continuous: • Time it takes for me to commute between my home and school (between 5 min and 15 min) • Time to complete a survey Definition 1.11 Qualitative data are measurements that cannot be measured on a natural numerical scale; they can only be classified into one of a group of categories.4 Example 2 National Bridge Inventory. All highway bridges in the United States are inspected periodically for structural deficiency by the Federal Highway Administration (FHWA). Data from the FHWA inspections are compiled into the National Bridge Inventory (NBI). Several of the nearly 100 variables maintained by the NBI are listed next. Classify each variable as quantitative or qualitative. Qualitative Quantitative • Toll bridge (yes or no) • Condition of deck (good, fair or poor) • Type of route (interstate, U.S., state, county, or city) • • • • Length of maximum span (feet) Number of vehicle lanes Average daily traffic Bypass or detour length (miles) Definition 1.12 A representative sample exhibits characteristics typical of those possessed by the target population. Definition 1.13 A random sample of n experimental units is a sample selected from the population in such a way that every different sample of size n has an equal chance of selection. Example 3 Example 1 Cont’d. Refer to Example 1, 3. Is the sample representative of the population? Yes! Since the sample was a random sample, it should be representative of the population. Definition 1.14 Statistical thinking involves applying rational thought and the science of statistics to critically assess data and inferences. 4 Often, we assign arbitrary numerical values to qualitative data for ease of computer entry and analysis. But these assigned numerical values are simply codes: they cannot be meaningfully aded, subtracted, multiplied, or divided. 4 Example 4 Success/failure of software reuse. The PROMISE Software Engineering Repository, hosted by the University of Ottawa, is a collection of publicly available data sets to serve researchers in building prediction software models. A PROMISE data set on software reuse, saved in the SWREUSE file, provides information on the success or failure of reusing previously developed software for each in a sample of 24 new software development projects. (Data source: IEEE Transactions on Software Engineering, Vol. 28, 2002.) Of the 24 projects, 9 were judged failures and 15 were successfully implemented. • Experimental units: the 24 new software development projects. • Population: the set of all new software development projects. • Variable of interest: the outcome of reusing previously developed software for the new software development projects. Since the outcomes could either be successes or failures, the variable is qualitative. 1. Critically evaluate the statement “Since 15/24=0.625, it follows that 62.5% of all new software development projects will be successfully implemented.” Is it correct? No! In the sample, 15 of the 24 or 62.5% projects were judged as successfully implemented. This is the success rate of the sample. This would be a good estimate of the population percentage of successfully implemented projects, but it is only an estimate. If we took another sample of size 24, the percentage of successful projects would not necessarily be 62.5%. Definition 1.15 Selection bias results when a subset of the experimental units in the population is excluded so that these units have no chance of being selected in that sample. Example 5 Alf Landon vs. Franklin D. Roosevelt. During the 1936 presidential campaign, a classic instance of faulty sampling occurred. The Literary Digest, a respected magazine, polled about 10 million Americans (the largest number ever) and got responses from about 2.4 million. For ease, the individuals surveyed were selected from registered automobile owners, registered telephone subscribers, and the magazines own readers. The poll showed that Landon would likely be the overwhelming winner and FDR would get only 43% of the votes. However, the election result is FDR won, with 62% of the votes (Most lopsided in history in terms of electoral votes). The magazine was completely discredited because of the poll, and was soon discontinued. Why? The erroneous prediction occurred because the voters used in the sample were not representative of the general voting population. In 1936, telephones and automobiles were unaffordable to the average voter, and “John Q. Public”’ didn’t read Literary Digest either. Thus, the majority of voters didn’t really care what Literary Digest had to say, and didnt even bother submitting the survey. Definition 1.16 Nonresponse bias results when the researchers conducting a survey or study are unable to obtain data on all experimental units selected for the sample. 5 Definition 1.17 Measurement error refers to inaccuracies in the values of the data recorded.5 In surveys, this kind of error may be due to ambiguous or leading questions and the interviewer’s effect on the respondent. Example 6 A biologist studying the reproduction of a particular strain of bacterium might encounter random errors due to slight variation of temperature or light in the room. For your interest: A random error is a type of measurement errors. It is random in nature and very difficult to predict. Another type of measurement errors is systematic error. Random error can be estimated by comparing multiple measurements, and reduced by averaging multiple measurements. In contrast, systematic error cannot be discovered this way because it always pushes the results in the same direction. If the cause of a systematic error can be identified, then it can usually be eliminated. 5 Measurement error is not equal to ’mistake’ ! 6