Statistics-MAT 150 Chapter 1 Introduction to Statistics Prof. Felix Apfaltrer fapfaltrer@bmcc.cuny.edu Office:N518 Phone: x7421 Chapter 1 • Overview • Nature of data • Skills needed in statistics Overview Statistics: • Descriptive – Analyze nature of data from surveys, experiments, observations, • Inferential – Draw conclusions from the analyses with respect to the population Survey: tool to collect data from a smaller group which is part of a larger group to learn something about the larger group Key goal of statistics: •Learn about a large group (population) from data from from a smaller subgroup (sample) Overview Definitions: • Data: observations collected (measurements, gender, answers,…) • Statistics: collection of methods to analyze data • Population: complete collection of elements (scores, measurements, subjects,…) • Sample: subcollection of members from selected population • Census: collection of data from every member of the population Overview 2 Example: • Poll: 1087 adults are asked whether they drink alcoholic beverages or not. – Sample: 1087 adults – Population: US adults 150 million. • Census: Every 10 years, the census bureau tries to collect information from every member of the US population. – Impossible! – Very expensive! • Use sample data to draw conclusions from whole population: inferential statistics! Types of data Parameter: • A numerical measurement describing some characteristic of the population. • Lincoln elected: 39.82% of 1,865,908 votes counted. – 39.82% is a parameter. Statistic: • A numerical measurement describing some characteristic of the sample. • Based on a sample of 877 elected executives, 45% would not hire an applicant with a typographical error in the application. – 45% is a statistic. Types of data 2 Quantitative data:Numbers representing counts or measurements. • Weights of supermodels. Qualitative data: Nonnumerical. • Gender of an athlete. Discrete vs. continuous data • # of people in a household vs. temperatures in May. Nominal level of measurement: names, labels categories: no ordering. • Yes/No/Undecided responses, colors. Ordinal level of measurement: some order, but numerical values meaningless or nonexistent. • Course grades A, B, C, D, F. “Livability rank of a city”. Interval level of measurement: order, but “no 0” or meaningless. • Temperature, year. Ratio level of measurement: as before with meaningfull zero. • Weights, prices (non-negative). Basic skills Samples: • representative: • “39/40 polled people vote for A” Sampled in A’s headquarters! • Not too small: • CDF published “among HS students suspended, 67% suspended more than 3 times” Sample size: 3! Graphs: In which one does red do better? Median Weekly Income (16-24) Median Weekly Income (16-24) $390 $400 $380 $350 $370 $300 $360 $250 $350 $200 $340 $150 $330 $320 $100 $310 $50 $0 $300 Men Women Percentage of: • 6 % of 1200 = 6 / 100 * 1200 = 72% Fraction >>> percentage: • 3/4 = 0.75 >>> 0.75 * 100% = 75 % Men Women Percentage >>> decimal: • 27.3% = 27.3/100 = 0.273 Decimal >>> percentage: • 0.852 >>> 0.852 * 100% = 85.2% Calculator: Basic skills 2 Design Observational study: observe and measure characteristics without trying to modify subjects. • Gallup poll. • Cross-sectional: data observed, measured at one point in time. • Retrospective: data are collected from the past (records) • Prospective: data collected along the way from groups (smokers/NS) Experiment: apply treatment and observe and measure effects. • Clinical trial for Lipitor. • Control: blinding - placebo, double-blinding, blocks • Replication: ability to repeat experiment • Randomization: data needs to be collected in an appropriate (random) way, otherwise it is completely useless! – Random sample: members of the population are selected so that each individual member has the same chance of being selected. – Simple random sample of size n : every possible random sample of size n has the same chance of being chosen. Design 2 Sampling: • systematic: select starting point and every kth member chosen. • convenience: use easy to get data • stratified: subdivide population into at least 2 subgroups with common characteristic and draw samples from each (e.g. gender or age) • cluster: divide population into areas and draw samples form clusters Sampling error: the difference between a sample result and the true population result; results from chance sample fluctuations Nonsampling error: occurs when data is incorrectly collected, measured, recorded or analyzed.