Math 125 Statistics 1 1. 300 is what % of 2,000? 2. 100,000 families -> 0.1 of 1% of these have income greater than 75K. How many families is that? 3. There are 100 millions eligible voters in the US. The Gallup poll interviews 5,000 of them. This is equivalent to 1 out of every … ? 4. In the US (hypothetically), 1 in 500 is in the Army and 3 in 10,000 are officers. What % of Army personnel are officers? 5. 1 out of 1500 people is a Marine and 1/10 of Marines are officers. What % of population are officers in the Marine? 6. Without calculator: Sqrt(100,000) is closest to: 30, 100, 300, 1000? 7. 8. Is Sqrt(0.5) < 0.5 ? Is Sqrt(2) < 2? 9. Solve for x & y x + 3y = 1 2x + y = -3 10. 11. A quart of Vodka is 40% alcohol. A drink is made of OJ and Vodka (V quarts of Vodka for J quarts of OJ). What is the % alcohol in the drink? (formula function of V and J) Tom's mother is 3 times as old as he is. Next year, their ages will add up to 50. How old is Tom? 12. Probability: Throw one die. What is the chance of getting a 1? Throw two dice. What is the chance of getting two 1's? 13. Consider two situations: A) a Coin is tossed 100 times. If it comes up heads 60 times or more you win. B) It is tossed 1,000 times. If it comes up heads 600 times or more, you win. Which is better A or B? Statistics: Science of assembling, organizing, and analyzing data ▪ First gather data (available or to be sampled) ▪ Then analyze the data (graphs, numerical analysis) Probability theory: ▪ Distributions ▪ Making predictions: Inference ▪ Assessing prediction accuracy for decision making: Hypothesis testing Data Analysis The goal of statistics is to gain information from the data. First: Data are collected. Data come from several sources: 1. Available data: Census data, Federal agencies, Governmental Statistical Offices. • Example: Census Bureau, Triola textbook Datasets, DePaul QRC • Many more databases are available on the Internet! 2. New Data: • Sampling from population of interest: Observational studies • Conducting statistical experiments: medical trials, controlled experiments. When well designed, provide most reliable source of information!! Next step after data collection? Long listings of data are of little value. Statistical methods come to help us. Data analysis: methods to display and summarize the data. In this course we will deal with data on one variable at a time. The distribution of the observations is analyzed by 1) Displaying the data in a graph that shows overall patterns and unusual observations (histogram, box plot, density curve…) 2) Computing number statistics that summarize specific aspects of the data (center and spread). In election times, politicians want to know the percent of people that will support them. During the elections, news and political parties want to know the percent of votes even before the polls are closed. Market research firms collects data to study the efficacy of an advertisement, or the preference of the public towards certain brands or products (ex. Nielsen). A new Web company wants to know the average age, interests or educational level of their “visitors”. WinZip computing, Inc. wants to know the average number of people who download the freeware version of WinZip and then buy the software. Common to all these cases: Investigators want to gather information about a large group of individuals. This is called the population. Problem: often not feasible because of 1) time constraints, 2) costs, 3) inconvenience Solution: Select a part of the population and collect data from this part. Such a subset is called a sample. Data collected in the sample are analyzed for drawing conclusions about the entire population. In each study, generally there are some numerical facts about the population that we want to know about. Examples: 1) at exit polls, two possible quantities are • the percentage of voters for a certain party; • the percentage of registered voters that have cast their vote. 2) In Internet surveys: • average number of visitors to a web site; • average length of time of each visit; • average number of sites visited before purchasing a product online; • other examples? These quantities are called parameters. Parameters are estimated by statistics, the numbers calculated from the sample. NEW YORK, NY – October 11, 2001 A recent study in September 2001 from Nielsen / NetRatings reveals that nearly 56% of the office workers consumed streaming media (live video broadcasts and instant coverage of the news) in September. “More than 21 million office workers streamed the web media in September” “…growing market for Real Networks, Windows Media and Quick Time”. Did they interview 21 million office workers? How did they get this result? Parameter: a numerical measurement describing some characteristic of a population. Statistic: a numerical measurement describing some characteristic of a sample from the population. 11 Reliable Data Since you estimate the parameters from the sample, the sample must represent the population! Too often results from surveys are not correct, do you still trust the exit polls? This is because the sample is not representative of the population!! Selecting a “good” sample Draw a representative sample by selecting the individuals at random, like placing the names in a hat and drawing out a handful (the sample)! A simple random sample of size n is a sample of n individuals from the chosen population. Each individual in the sample has the same chance to be selected. So any sample is chosen using an unbiased impartial method! If the population is very large, a multistage cluster sampling is often used. The population is divided into classes or clusters and a simple random sample is drawn from randomly selected clusters. Example: household surveys, Nielsen TV preference data, USA territory is divided into states, towns, wards. Section 1-2 Types of Data 14 Definitions Quantitative data: numbers representing counts or measurements. Examples: Weights of athletes, income, gallons of gas pumped, etc. Qualitative (or categorical) (or attribute) data: can be separated into different categories that are distinguished by some nonnumeric characteristic Example: Sex (male/female), Religious affiliation, “Likes pizza” or not, etc. 15 Working with Quantitative Data Quantitative data can further be described by distinguishing between: • discrete and • continuous. 16 Definition Discrete data number of possible values is a naturally ‘countable’ number (i.e. : 0, 1, 2, 3, . . .) Example: The number of eggs that a hen lays. 17 Definition Continuous data infinitely many possible values with continuous scale coevring a range of values without gaps, interruptions, or jumps. Example: The amount of milk that a cow produces; (example: 2.343115 gallons per day) 18 Section 1-3 Critical Thinking 19 Success in the introductory statistics course typically requires more common sense than mathematical expertise! This section is designed to illustrate how common sense is used when we think critically about data and statistics. 20 Misuse # 1- Bad Sample Such as: Voluntary response sample (or self-selected sample) one in which the respondents themselves decide whether to be included. In this case, valid conclusions can be made only about the specific group of people who agree to participate. Misuse # 2- Sample too Small Example: Basing a school suspension rate on a sample of only three students. Misuse # 3- Inadequate Graphs You should analyze the numerical information not just the shape! 21 Misuse # 4- Exaggerated Pictographs Wrong to multiply ALL the dimensions! Misuse # 5- Abusing Percentages Misleading or unclear percentages are sometimes used. For example, if you take 100% of a quantity, you take it all. 110% of an effort does not make sense. 22 Other Misuses of Statistics Loaded Questions Order of Questions Refusals Correlation & Causality Self Interest Study Precise Numbers Partial Pictures Deliberate Distortions 23 Section 1-4 Design of Experiments 24 Key Concept If sample data are not collected in an appropriate way, the data may be so completely useless that no amount of statistical analysis can fix them. Definitions: 1- Observational studies: observations or measurements are made, without interference with the subjects (interference can modify the results). 2- Controlled Experiments: a treatment is applied to the subjects (experimental units). The effects are then measured and analyzed. 25 Data can be collected in different ways: Cross sectional study data are observed, measured, and collected at one point in time Retrospective (or case control) study data are collected from the past by going back in time Prospective (or longitudinal or cohort) study data are collected in the future from groups (called cohorts) sharing common factors 26 A possible problem in collecting data: Confounding factor: occurs in an experiment when the experimenter is not able to distinguish between the effects of different factors. In an observational study, we do not assign subjects to the treatment and the control groups. We can only observe subjects with a given condition (treatment group) and subjects without it (control group). The problem is that often the treatment/condition is badly biased by confounding factors, that influenced the outcome of the study. Hidden confounding factors = unobserved variables that affect the responses being studied. Example: Studies on smoking are observational, they compare smokers to non-smokers. Even though they show evident association between smoking and several diseases (lung cancer, coronary disease), they do not completely reject the possibility that a confounding unobserved variable might be the actual cause of the observed association. For instance a genetic factor that makes people more likely to smoke and contract the disease. 27 In an observational study, a confounding factor can sometimes be controlled. Block design: Divide the population into groups of similar individuals called blocks. Randomly assign units to treatments for each block. Example: Effect of a new treatment against hypertension. The blocks are: age, sex and smoking habits General population <50 years old Men >=50 years old Women smokers smokers Not smokers Men smokers Not smokers Women smokers Not smokers Not smokers More ways to Control the Effects of Variables: In addition to Forming Blocks: grouping subjects with similar characteristics, also: Blinding: subject does not know he or she is receiving a treatment or placebo Double Blinding: subject and experimenter do not know who is receiving a treatment or placebo. (ex FDA etc…) Completely Randomized Experimental Design: subjects are put into blocks through a process of random selection. 29 More definitions (data collection terms): Replication: repetition of an experiment when there are enough subjects to recognize the differences from different treatments. Sample Size use a sample size that is large enough to see the true nature of any effects and obtain that sample using an appropriate method, such as one based on randomness. Random Sample: members of the population are selected in such a way that each individual member has an equal chance of being selected. Simple Random Sample (of size n): subjects selected in such a way that every possible sample of the same size n has the same chance of being chosen. 30 Methods of Sampling: Random Sampling selection so that each individual member has an equal chance of being selected. Stratified Sampling subdivide the population into at least two different subgroups that share the same characteristics, then draw a sample from each subgroup (or stratum) Systematic Sampling Select some starting point and then select every k th element in the population. Convenience Sampling use results that are easy to get. Cluster Sampling divide the population into sections (or clusters); randomly select some of those clusters; choose all members from selected clusters 31 Methods of Sampling - Summary Random Systematic Convenience Stratified Cluster 32 How good are the data? Sampling error: = difference between a sample result and the true population result; Cause: chance fluctuations in the sample. Nonsampling error: or systematic error. Causes: data incorrectly collected (ex: a biased sample) data incorrectly recorded (ex: defective instrument) data incorrectly analyzed (ex: wrong calculation, error in formula) The Statistic will differ from the Parameter according to: statistic value)= parameter + bias + chance error ▪ ▪ Chance error = sampling error Bias = systematic error (depends on the way the sample was collected) 33