The Nature of Statistics: The art of learning about and understanding our world through data. Essentials: The Nature of Statistics (a.k.a: The bare minimum I should take along from this topic.) • Definitions and relationships as presented on the Anatomy of the Basics: Statistical Terms and Relationships sheet • Identification of variables and their characteristics • Careful review of data and their presentation • Providing a context for the data • Why use percentages rather than numeric counts when making comparisons 69.2 80 35,000 • What do you know about these numbers? • What do they mean to you? • What is missing? Okay, so What is Statistics? (or is that What ARE Statistics?) Statistics is the study of how to collect, organize, analyze, interpret and report numerical information in order to make decisions. Statistics are the numeric data we use to better understand our world. They may take the form of frequencies, means, percentages, variances, etc. What is a Study? • 3 Types: • Observational – observe and measure; can identify association, not causation. • Experimentation – impose treatment and observe characteristics; can help establish causation. • Simulation – using computers to simulate situations that are not practical to do in real time. Basic Terminology • DATA: Are numbers with a context - i.e. numbers with meaning. – Examples: not 48.2, but 48.2 kg. not 5.23, but 5.23 inches) • VARIABLE: A characteristic or property of an individual population unit that varies from one person or thing to another. – Examples: age, square footage, and assessed value represent three variables associated with homes in Oneonta. – Variables have Values. Example: The variable hair color has the values of brown, blonde, red, etc. • UNIT (Element): Any individual member of the defined population. – Examples: Each bottle of soda in a production run is a unit; each penny in a roll of pennies is a unit; each person enrolled in a class is a unit. Data: One variable (here unidentified, i.e. no context), multiple values “Raw” Data (N=160) “Organized” raw data (N=160) Unit 73 “different” numbers Time Period Otsego Lake was Frozen (days) Raw Data Grouped Data Time Period Otsego Lake was Frozen (days) Otsego Lake: Days Frozen !849-50 to 2009-10 70 60 Frequency 50 40 30 20 10 0 0-24 25-49 50-74 75-99 Days 100-124 125-149 Data: Two Variables: year and days; multiple values Time Period Otsego Lake was Frozen: Mean Days/Decade Time Period Otsego Lake was Frozen: Mean Days/Decade So is the Greenhouse Effect at work here? To be studied through further statistical analysis, such as the use of ANOVA… Anatomy of the Basics: Statistical Terms and Relationships Descriptive Statistics: methods for organizing and Statistics is the study of how to collect, organize, analyze, interpret and report numerical information. summarizing information. E.g. Number of students in this class by major, baseball standings, housing sales by month. Inferential Statistics: methods for drawing conclusions and measuring the reliability of those conclusions using sample results. E.g. Political views of all 4-year college students. Parameter: numerical characteristic of a Population: all individuals, items, or objects Population vs. Sample whose characteristics are being studied. population. Census: data collected from ALL members of the population. Sample: a portion of the population Statistic: numerical characteristic of a selected for study. sample. Variable: a characteristic or property of an individual unit. Variables have values. Qualitative: a variable that cannot be measured numerically E.g. Gender, eye color. Discrete: a variable whose values are countable. It can Quantitative: a variable that can only assume certain values, with no intermediate values. E.g. Number of auto accidents in Oneonta in 1998. be measured numerically. E.g. Income, height, number of siblings one has. Continuous: a variable that can assume any numerical value over an interval or intervals. E.g.Time. Nominal: grouping individual observations into qualitative No Arithmetic Operations: individual observations can only be categorized. categories or classes. E.g. Grouping individuals by whether they are left-handed or right-handed. Ordinal: individual observations are assigned a number or “ranking.” There is a sense of “more than,” but you cannot say “how much” more than. E.g. Military ranks. Scaling of Variables (Measurement Levels) Arithmetic Operations: individual observations have meaningful numeric values. Interval: variables have no true zero point. Cannot say how much more. E.g. Temperature ( F or C), IQ scores. Ratio: variables have a true zero point. Can say how much more. E.g. Weight, height. Population Basic Terminology • POPULATION: – Complete collection of all elements or units (usually people, objects, transactions, or events) that we are interested in studying. – In terms of data, a population is the collection of all outcomes, responses, measurement, or counts that are of interest. – CENSUS: A complete enumeration (or accounting) of the population (i.e. collecting data from every element (or unit) in the population). – PARAMETER: A numeric value associated with a population. (e.g. - the average height of ALL students in this class, given that the class has been defined as a population) Sample Basic Terminology • SAMPLE: Taken from a population a sample is a subset from which information is collected. – Example: 25 cans of corn (sample) randomly obtained from a full days production (population) • STATISTIC: A numeric value associated with a sample. – Example: the average height of 10 individuals randomly selected from the class (defined population). • INFERENCE: An estimate, prediction, or some other generalization about a population based on information contained in a sample. – Example: Based upon a randomly selected sample of 25 flights at JKF International Airport (the sample; individual flights are units) taken from all flights on Dec. 24, 2009 (defined population), we can state with a degree of confidence the mean delay for the population of the day’s flights was 35 minutes (sample statistic in context being inferred to the population). In Summary To include ALL units, you are looking at: • POPULATION • CENSUS • PARAMETERS Parameter Statistic To work with a subset of all units, you are looking at: • SAMPLE • STATISTICS • INFERENCES to a population Population Sample Example: Identifying Data Sets In a recent survey, 1708 adults in the United States were asked if they think global warming is a problem that requires immediate government action. Nine hundred thirty-nine of the adults said yes. Describe the data set. Identify: The population: The sample: A variable being studied: Values of the Variable: Source; Adapted from: Pew Research Center; Larson/Farber 4th ed. Examples: Populations & Samples • Smoking: Identify the population and sample. – A survey, 250 college students at Union College were asked if they smoked cigarettes regularly. Thirty-five of the students said yes. Identify the population and the sample. • Student Income: Decide whether the numerical value describes a population parameter or a sample statistic. • A survey of 450 Cornell University students reported their average weekly income from part-time employment was $325. • For both of the above studies: – What are the units of the population/sample? – Identify a variable being studied. – Identify values of the variable. Descriptive Statistics: • DESCRIPTIVE STATISTICS: Organize and summarize information using numerical and graphical methods. – Examples: • Summarizing the age of cars driven by students in a frequency table. • Graphing the ages of students. • Identifying the mean speed of cars driving in a 30 mph zone. • A descriptive statement describes some aspect of the data. (Select a statistical measure and put it into sentence format.) – Examples: • Thirty-eight percent of the orange trees suffered damage due to the cold temperatures. • The average weight for the 23 cars studied was 2,738 lb. • The mean number of days Otsego Lake was frozen per winter was 88.69 days. Descriptive Statistics at Work: SUNY Oneonta Car Registrations Numeric tables, pictures (graphs & charts), and text are three methods used to present data. During the 2006 year there were 1.346 cars registered at SUNY Oneonta. Car registrations contain many variables, such as car type, car color, year of car, and license plate number. Noted below are ways descriptive statistics are used to convey information about the selected variables: a frequency table of Registrant Type (i.e. who registered the car); a graphic presentation of Vehicle Age; and text (written descriptive statement) presenting the mean Vehicle Age, of the registered cars. Frequency Table: Graphic presentation (here a Histogram): Registrant Type Valid Commuter Faculty M anagement Ot her Frequency Percent 512 38.0 Valid Percent 38.0 Cumulative Percent 38.0 223 16.6 16.6 54.6 13 1.0 1.0 55.6 58 4.3 4.3 59.9 Resident 287 21.3 21.3 81.2 Staff 253 18.8 18.8 100.0 Total 1346 100.0 100.0 Mean & Median: The Mean age of cars driven by students was 7.45 years (vs. 6.19 yrs. for employees). The Median age of registered vehicles for students was 7.0 years (5.0 years for employees). Inferential Statistics: • INFERENTIAL STATISTICS uses sample data to make estimates, decisions, predictions, or other generalizations about the population. – The aim of inferential statistics is to make an inference about a population, based on a sample (as opposed to a population census), AND to provide a measure of precision for the method used to make the inference. • An inferential statement uses data from a sample and applies it to a population. Examples of Inferential Statistics: • A Gallup Poll found that 57% of dating teens had been out with somebody of another race or ethnic group (+/- 4.5%; 95% CI) – Interpretation: We are 95% confident that between 52.5% and 61.5% (57% +/- 4/5%) of dating teens have been out with someone of a different race/ethnicity. • A Gallup Poll found that 40% of Americans would quit their job if they won the lottery (+/- 4%; 95% CI). – Interpretation: We are 95% confident that the true population proportion of Americans who would quit their job if they were to win a lottery lies between 36% and 44%). Example: Descriptive and Inferential Statistics Decide which part of the study represents the descriptive branch of statistics. What conclusions might be drawn from the study using inferential statistics? A large sample of men, aged 48, was studied for 18 years. For unmarried men, approximately 70% were alive at age 65. For married men, 90% were alive at age 65. Source: (The Journal of Family Issues) Larson/Farber 4th ed. Characteristics of Data Before conducting any data analysis the characteristics of the variable under study must be identified. This will result in utilizing appropriate tables, graphs and statistical analysis. Two Types of Data • Qualitative Data can be separated into different categories (values) that are distinguished by some nonnumeric characteristic. Qualitative data are also referred to as categorical or attribute data. – Examples include gender, eye color, and car brands – Note that the values of this type of variable are differentiated by words rather than numeric values. Example: Eye Color values include blue, brown, hazel, etc. • Quantitative Data are “number-based” and represent counts or measurements. This type of data may be subdivided into two categories... • Discrete Data - result when the number of possible values is either a finite or a countably infinite number. – Examples: Siblings, Cars, and Coins in a jar (think of whole number counts here; even if you cannot count them all). • Continuous Data - result from infinitely many possible values corresponding to some continuous scale that covers a range of values without gaps, interruptions, or jumps. Continuous data can assume any value, including fractional parts. – Examples: Height, Weight, Time N.B.: Qualitative data cannot be classified as discrete or continuous. Example: Classifying Data by Type The base prices of several vehicles are shown in the table. Which data are qualitative data and which are quantitative data? (Source Ford Motor Company) Source: Larson/Farber 4th ed. 4 Levels of Measurement The level of measurement determines which statistical calculations are meaningful. The four levels of measurement are: nominal, ordinal, interval, and ratio. Nominal Levels of Measurement Ordinal Interval Ratio Lowest to highest Levels of Measurement (cont.) • Nominal – characterized by data that consist of names, labels, or categories only. The data cannot be arranged in an ordering scheme. Qualitative data. – Examples: Gender, Yes/No, Political Party affiliation, names of students. • Ordinal – characterized by data that can be arranged in some order, but the differences between data values either cannot be determined or are meaningless. These variables may be either qualitative (categorical) data or quantitative (numerical) data. – Examples: Military Rank, Position in a race, Attitude scales. Levels of Measurement (cont.) • Interval – like the ordinal level, with the additional property that the difference between any two data values is meaningful. However, there is no natural zero starting point. Quantitative data. – Examples: Temperature (F or C); longitude; Calendar Years. • Ratio – is the interval level modified to include the natural zero starting point. At this level, differences and ratios are both meaningful. Quantitative data. – Examples: Height, Weight, Time, Age. Summary of Levels of Measurement Put data in categories Arrange data in order Subtract data values Determine if one data value is a multiple of another Nominal Yes No No No Ordinal Yes Yes No No Interval Yes Yes Yes No Ratio Yes Yes Yes Yes Level of measurement Example: Classifying Data by Level Two data sets are shown. Which data set consists of data at the nominal level? Which data set consists of data at the ordinal level? (Source: Nielsen Media Research) Source: Larson/Farber 4th ed. Example: Classifying Data by Level Two data sets are shown. Which data set consists of data at the interval level? Which data set consists of data at the ratio level? (Source: Major League Baseball) Source: Larson/Farber 4th ed. Anatomy of the Basics: Statistical Terms and Relationships Descriptive Statistics: methods for organizing and Statistics is the study of how to collect, organize, analyze, interpret and report numerical information. summarizing information. E.g. Number of students in this class by major, baseball standings, housing sales by month. Inferential Statistics: methods for drawing conclusions and measuring the reliability of those conclusions using sample results. E.g. Political views of all 4-year college students. Parameter: numerical characteristic of a Population: all individuals, items, or objects Population vs. Sample whose characteristics are being studied. population. Census: data collected from ALL members of the population. Sample: a portion of the population Statistic: numerical characteristic of a selected for study. sample. Variable: a characteristic or property of an individual unit. Variables have values. Qualitative: a variable that cannot be measured numerically E.g. Gender, eye color. Discrete: a variable whose values are countable. It can Quantitative: a variable that can only assume certain values, with no intermediate values. E.g. Number of auto accidents in Oneonta in 1998. be measured numerically. E.g. Income, height, number of siblings one has. Continuous: a variable that can assume any numerical value over an interval or intervals. E.g.Time. Nominal: grouping individual observations into qualitative No Arithmetic Operations: individual observations can only be categorized. categories or classes. E.g. Grouping individuals by whether they are left-handed or right-handed. Ordinal: individual observations are assigned a number or “ranking.” There is a sense of “more than,” but you cannot say “how much” more than. E.g. Military ranks. Scaling of Variables (Measurement Levels) Arithmetic Operations: individual observations have meaningful numeric values. Interval: variables have no true zero point. Cannot say how much more. E.g. Temperature ( F or C), IQ scores. Ratio: variables have a true zero point. Can say how much more. E.g. Weight, height. Misuse of Statistics ah yes… the old torture the data long enough and they will confess to anything routine... • Precise Numbers Tonight’s paid attendance was 56,423 • Guesstimates It was estimated that one million spectators lined the road to L’Alpe d’Heuz for the 16th stage of the 2004 Tour de France race. • Distorted Percentages New and improved with 50% more ... – 50% might not be a meaningful amount. • Partial Pictures Ford truck adds • Loaded Questions Line item veto • Misleading Graphs Visual distortions of data • Pictographs The crescive cow. • Pollster Pressure Public bathrooms. • Small/Bad Samples 67% suspended • Self-Selected Surveys CNN phone-in surveys Pictograph: “This year my business profits doubled!” Visual Presentations of Data – Beware Source: http://findarticles.com Data Considerations • Anecdotal Evidence – basing our conclusions on a few individual cases. e.g. We remember the airplane crash that kills several hundred people and fail to notice that data for all flights show that flying is much safer than driving. • Lurking Variables – almost all relationships between two variables are influenced by other variables lurking in the background. Airline Flights: Alaska Airlines vs. American West Which would you choose to fly? On Time Delayed Alaska Airlines America West 3274 (86.7%) 6438 (89.1%) 501 (13.3%) 787 (10.9%) Alaska Airlines vs. American West A Closer Look Alaska Air America West On Time Delayed On Time Delayed Los Angeles 497 62 694 117 Phoenix 221 12 4840 415 San Diego 212 20 383 65 San Francisco 503 102 320 129 Seattle 1841 305 201 61 TOTAL 3274 501 6438 787 Departure Location We now know that American West has a better “On Time” record, but Alaska Airlines has a better “On Time” record at every airport. How can that be? Alaska Air On Time Delayed On Time Delayed 497 62 694 117 (88.9%) (11.1) (85.6) (14.4) 221 12 4840 415 (94.8) (5.2) (92.1) (7.9) 212 20 383 65 (91.4) (8.6) (85.5) (14.5) 503 102 320 129 (83.1) (16.9) (71.3) (28.7) 1841 305 201 61 (85.8) (14.2) (76.7) (23.3) 3274 501 6438 787 (86.7) (13.3) (89.1) (10.9) Departure Location Los Angeles Phoenix San Diego San Francisco Seattle TOTAL America West End of Slides