Chapter 3 Statistics Deterministic vs Stochastic Deterministic systems are ones in which the same input produces the same output each time. For example, the same inputs for the length and width of a rectangle produces the same area. Investments in fixed rate accounts produce the same yield. Stochastic systems are ones in which the same input produces different outputs each time. For example, medicines have different effects on different people. Disciplinary strategies have different effects on different children. Example: The economic system is very complex. To make a good stock and flow model for the economy requires knowing a lot of details. One such detail is the effect of changes in the minimum wage. If the Federal government raises the minimum wage, will employment go down because employers can’t afford to pay the extra cost, will employment go up because more people will have more money to spend leading to more jobs, or other. Circle your choice: Down Up Other Percent Confidence Your Choice is Correct _____________ How do we find evidence to prove who is correct? Can we trust our intuition? Is there a formula we can use? Is the answer going to be the same in all circumstances? Is there one perfect answer? Evidence-Based decision making: Because of the complexity of stochastic systems, it would be beneficial if leaders made use of evidence, instead of ideology, when making decisions. Below is a graph of the changes in employment following increases in minimum wage. What evidence does this graph provide? Because of the variation in stochastic systems, it is better to think about the distribution of possible outcomes when taking an action than to expect one particular outcome every time. Therefore, in this example, the arguments on both sides of the issue should be that sometimes employment increases and sometimes it goes down and sometimes it stays approximately the same. That is the distribution of possible outcomes. The challenge that is typically faced when trying to understand a stochastic system is that the system is complex, with many interactions, and that we can almost never get complete information. Therefore we must function with only partial evidence. The field of mathematics that is designed to help us use partial information to answer big questions is Statistics. 1. Ask specific questions 2. Design research 3. Randomly select from the population 4. Make graphs and produce summaries of the data 5. Explore the probabilities of different possible outcomes. Before looking at the details of statistical analyses, we need to understand the concept of probability because the evidence we do get is subject to chance. Probability (Formula 3.1) What values can it take? Examples: Coins, dice Probability Distributions – start with a possibility distribution then determine or estimate the probability of each possibility. Example: What is the distribution of outcomes at the end of the quarter for a randomly selected student? Possibility Distribution Probability A. Pass (2.0+) 53/67 = 0.791 B. [1,2) 6/67 = 0.089 C. Fail (0) 6/67 = 0.089 D. Withdraw 2/67 = 0.030 The probabilities are based on prior classes, since that is the best available evidence. Make your own probabilities for yourself. You will base it on how well you are currently doing, your math abilities, effort, attendance, normal grades, and motivation. Complete Activity 3.1 Possibility Distributions and Probability Distribution on page 50 and 51. How does thinking about probability distributions affect your thinking? Key terms: Population – all times minimum wage is raised Census – all times minimum wage is raised Sample – some of the times minimum wage was raised Example: If we wanted to know the effect of state funding for music programs in k12, what is the population? Population: All students who could learn music Census: Getting data from every person in the population. Sample: Students from some schools or states with programs. If we want to know the effect of the Mediterranean diet on weight, what is the population? Population: All people using the Mediterranean diet to lose weight Census: Getting data from every person in the population. Sample: finding the results from some people using the diet For each group of students, think of one question that relates to your project, then define the population, census, and sample. Record it in your notes for later use. Population: Census: Sample: You will now have the opportunity to take a sample of data. This data will be used to practice statistical techniques that you will learn during the chapter. Experiential 3a or 3b Page 125/127 In order for statistics to provide useful understanding, it is necessary to ask the right questions. Asking the right questions Is state funding of school music programs a good use of money? This is a general question that is sufficiently vague so that it can be argued both ways. To get the details, we need more specific questions. Are students in school music programs more likely to attend college than students not in music programs? Are students in school music programs less likely to get into trouble than students not in music programs? Do schools with music programs have a higher graduation rate than schools without music programs? Is the GPA of music students different than of athletes? Is the Mediterranean Diet the best Diet? Do people lose more weight on it in the first month? Do people keep weight off longer? Are Cholesterol levels better? Are blood pressures better? Is eating that diet sustainable, e.g. it becomes part of a lifelong way to eat? Different specific questions yield different interpretations of the general questions. Ask specific questions for your group’s problem. Data is the evidence. Answers to specific questions are data. They are the evidence that show the variety of responses to the questions. There are two types of data that can be collected, categorical and quantitative. Are students in school music programs more likely to attend college than students not in music programs? The data is: attend college or don’t attend college, this is categorical Is the GPA of music students different than of athletes? The data is: GPA This is quantitative What data would be collected to answer the following question? Is the data categorical or quantitative? 1. Should the United States begin a transition from gas powered cars to electric cars? 2. Has the average number of passenger miles using mass transit increased faster than the population? What kind of data will you get for your group’s question? Observational Studies and Experiments Research is often done because we want to either understand the characteristics of a population or see the effect of an action on a population. In the former case, we conduct an observational study while in the latter case we conduct an experiment. Observational study – researcher collects data Compare data from schools with music programs and schools without. Compare weight changes from people who used the Mediterranean diet and those who used other diets. Experiment – to test the impact of something, need a comparison and a control group. Randomly allocate funding to different school then compare the students with music to those without. Randomly assign people to different diets then compare the effect. Would an observational study or experiment be appropriate for your question? We will now begin a multiday analysis of the electricity production system because of its impact on the climate system. IV Generation Nuclear - Bill Gates Electrical Power - Energy Justice Map Theoretical I, Page 117 Suppose the goal for the United States is to replace all the coal power plants with something that does not produce carbon emissions. What do we need to know? Would we conduct an experiment or study? Sampling The results from studies and experiments are worthless if sampling is inadequate. Random sampling from the population. Simple random Sample Row Number 1 83984 22116 01657 83717 24799 00515 37723 23445 02705 26127 2 78425 65082 07792 43850 22134 76033 87273 13972 58089 12538 3 96268 62423 63347 09111 12079 58082 88984 76565 62765 35923 4 58037 43470 88497 98909 79230 36845 30325 82655 48666 55431 5 52354 04992 47754 31246 36779 27029 88187 19275 89632 21684 6 65936 11549 15979 92704 42288 07121 54938 08990 00190 81402 7 01849 40765 97487 56378 80291 40351 95246 58004 56115 53197 8 94368 20871 13867 61232 87091 67621 27560 81197 63987 01118 9 24504 75557 58840 99065 49850 55957 14117 62890 24961 54550 10 13283 33042 69362 92759 81354 76328 76438 29699 86996 65089 Sampling with replacement Stratified sampling – e.g For music programs, we could sample from various socioeconomic areas. For Mediterranean Diets, we could sample from various backgrounds (past exercise and eating habits, gender, age). How would you sample for your group’s question? Randomly Select States on Page 129 for Theoretical I Graphs and Statistics Overview For data to be useful as evidence in decision making, it is necessary to organize and summarize it. This is accomplished using graphs and statistics. The graphs and statistics used for categorical data are different than those used for quantitative data. Statistics are numerical summaries of the sample data. Common statistics include proportion, mean, median. Below is a brief overview of the graphs and statistics. Categorical Data For categorical data, the two most commonly used graphs are bar graphs and pie charts. The two most commonly used statistics are counts and proportions. A bar graph is used when there are separate categories and a count or other measurement is recorded for each category. The bar graph below shows the number of calories burned with one hour of each of the activities. A pie chart shows the proportion of each category out of a whole. According to the Pew Research Center, in 2013, 56% of all American adults have a smartphone. Statistics for categorical data can be shown as either counts or proportions, although proportions is more common. For example, if a survey of students in a class showed that 23 out of 35 had a good, home cooked meal last night, this result could be reported as a count – 23 or as a proportion which is identified by the variable where . Make a pie chart based on classroom data Quantitative Data When there is only one quantitative variable, the graphs that are normally used are histograms and box plots. If there are two quantitative variables, scatter plots are use. A histogram is shown below of the 2013 attendance at randomly selected National Parks, which includes national monuments and other historical sites. https://irma.nps.gov/Stats/Reports/Park. The x axis shows the range of values for each bar. The y axis, which is labeled No of obs (number of observations), shows how many parks had attendance within each range. Thus in this sample, there were 14 parks which had between 0 and 500,000 visitors. There was one park with between 3 and 3.5 million visitors (Yellowstone). Make a histogram for the QAW scores Spring 2016 3.2 2.89 1.37 3.19 1.78 2.9 2.1 3.46 1.64 2.65 3.22 3.21 1.85 3.07 0.78 2.75 3.03 2.73 1.89 2.98 The statistics that are used for quantitative data include the mean, median, and standard deviation. The mean and median are two ways to represent the center of the data. The mean shows the balance point of a histogram. It can be influenced by one or several extreme values. The mean of the sample is found with the formula x x . This formula indicates that all the data values should be n added and then the total is divided by the number of data values. The median has an equal number of values above and below it and is not affected by extreme values. The standard deviation is one way to express the variation that exists in the data set. It can be interpreted as the approximate average distance each point is from the mean. The formula for the standard deviation of a sample is s x x 2 . Larger standard deviation values indicate more variation. n 1 Inference – parameters and statistics Statistic (from a sample) Proportion Parameter (from a population) p Mean x µ (mu) Standard Deviation s σ (lower case sigma) Explain the reason for inference – want to know the parameter to make the best decision but can only know the statistic. Statistics vary. Consider doing the marble sampling from stats. There are two types of inference, hypothesis tests and confidence intervals. Hypothesis tests are done when a researcher has a theory about the parameter and wants to test if the theory is reasonable. Confidence intervals are done when there is no theory, but only an estimate is desired. Probability Because random sampling results produces only one of the many statistics that make up a sampling distribution, then it is only by chance that that particular sample was selected. Probability is the proportion of times an outcome will occur over the long run. The emphasis on long term is very important. We view probability as a fraction: P( x) number of favorable outcomes . Assume all number of possible outcomes possible ways are equally likely as is the case when random sampling is done with replacement. Probability is always a numerical value between 0 and 1. This can be shown as 0 ≤ P(x) ≤ 1. The probability is 0 if the event cannot occur. The probability is 1 if the event is a sure thing – it occurs every time. Random processes – coin flips, marbles, random selection Sample Space Complements Normal Curve Fits neatly over binomial distribution and central limit theorem distribution. Area corresponds to probability 68-95-99.7 rule p̂ x p̂ p̂ p̂ x x x p̂ p̂ p̂ p̂ p̂ x x x x x Mean Standard Deviation (standard error) pˆ p pˆ p1 p n x x n From the empirical rule, it is known that 95% of all statistics are within 2 standard deviations (standard errors) of the mean. This means that 95% of all sample proportions are within 2 standard deviations of the proportion of the population. Likewise, 95% of all sample means are within 2 standard deviations of the mean of the population. If we define an event as any outcome within 2 standard errors of the mean, then the probability of that event is 0.95. The probability that an outcome is not within 2 standard errors is the complement. This is found by subtracting 0.95 from 1. Thus the probability of the complement is 0.05. Confidence Intervals Reason through the process of creating confidence intervals. The estimated standard error for proportions is s pˆ is s x s n pˆ 1 pˆ . The estimated standard error for means n . Approximately 2 of these standard errors are added and subtracted from the point estimate to produce the confidence intervals. A simplified version of the two confidence interval formulas are pˆ 2 pˆ 1 pˆ s and x 2 . The terms after the plus or minus sign is the margin of error. n n Limits of Statistics The branch of mathematics called statistics is a collection of wonderful mathematical tools that help us understand our world. The more rigorous the application of statistics, the better results we can obtain. But there are limits to statistics as well and these limits should be considered as you evaluate the results of someone’s research. Following are some of the questions you should ask about any research. 1. 2. 3. 4. 5. Was the sample size sufficiently large? There is considerable variation among small sample sizes, thus results are more easily contradicted with further research. Has the experiment or study been replicated and produced similar results? What sorts of bias could have affected the results? A few of the many possible sources include: a. Poor survey questions, or questions in which none of the responses seem to be appropriate options. b. Participant awareness of being part of an experiment that influences outcomes. c. External validity – do the results of an experiment have legitimate broader implications? d. Were participants randomly selected? Are all the subjects college undergraduates? Does the researcher, or reporter of the research, provide enough information to evaluate the conclusions. For instance, it can be frustrating to be told a result, but not given any information on standard deviations, sample size, or p-value. Keep in mind that we live in a complex world full of interactive systems. Statistics help us understand a reductionist view of the world. As such, the insights we gain can then be incorporated into a systems view, enhancing our ability to model the world.