MAT 526 Day 1 Probability is the branch of mathematics that deals with randomness and uncertainty. It may be the most applicable of the mathematical disciplines. “It is unlike other branches of math because they may not have Probability on Mars.” “Statistics … the most important science in the whole world: for upon it depends the practical application of every other science and of every art; the one science essential to all political and social administration, all education, all organization based upon experience, for it only gives the result of our experience. Florence Nightingale (1820 – 1910) Statistics: the science that deals with the collection, description, analysis and interpretation of data. (Mugno, 1997) When most people hear statistics they think of descriptive statistics. Numbers or graphics that summarize a data set. Examples include batting average, median income, disease prevalence, etc… Descriptive statistics are important but usually pretty simple. Inferential statistics are when we use sample data to make estimates, decisions and predictions about a larger data set or population. This is why statistics is so important. Four major themes to keep in mind throughout the course. Design: How the data is collected is extremely important and will affect greatly your analysis and interpretation. Designed experiments, surveys, polls, etc… Description: Can be very important in how your results are perceived by the reader. Bad practices can cause very misleading results. Descriptive statistics, graphs, tables, etc … Analysis: It is very important to use a proper methodology and which descriptive statistics are appropriate. Weighted, biased, etc… Interpretation: inferences or making decisions based on design, descriptive statistics and analysis. This is what makes statistics so important and powerful, especially in today’s data driven society. Ex. Governor’s Race In the governor’s race, Foley leads 89 – 7 percent among Republicans while Malloy leads 88 – 9 percent among Democrats. Independent voters shift from 50 – 41 percent for Malloy last week to 55 – 33 percent for Foley today. Another 6 percent of likely voters are undecided and 11 percent of those who name a candidate say they might change their mind. Men back Foley 51 – 43 percent, while women back Malloy 48 – 43 percent. Malloy gets a split 44 – 41 percent favorability from Connecticut likely voters, compared to 47 – 34 percent last week. Foley’s 48 – 34 percent favorability compares to 45 – 33 percent last week. “The late deciders are breaking for Tom Foley. There has been a big shift among independents in the final week of the campaign toward the Republican,” Dr. Schwartz said. “Dan Malloy’s unfavorables have risen to the point where he gets a mixed favorability rating for the first time. “For Foley to win, he needs to win the independent vote by a substantial margin, which he is now doing for the first time. But this race is too close to call. With 6 percent still undecided there is still room for movement. “Foley has the numerical lead and the momentum but Malloy still could pull this out.” From October 25 – 31, Quinnipiac University surveyed 930 Connecticut likely voters with a margin of error of +/- 3.2 percentage points. The Quinnipiac University Poll conducts public opinion surveys in New York, New Jersey, Connecticut, Pennsylvania, Florida, Ohio and the nation as a public service and for research. What type of design was used to collect these data? How are these data described? How was the data analyzed? What conclusions (inferences) are drawn from the data? The design here is a poll. Specifics are given as to how the poll participants were selected. This can be very important and greatly affect the results and interpretations. We do know about 3 variables here: Gender, Party affiliation and who are you likely to vote for? The data are described by percentages (relative frequencies) and the total is given. The data analysis was very simple, they just found the percents of responses. The conclusions are generalizing these results to the general population. Note: Polls are not always reliable. But assume that this one is. Margin of error is given as ±3.2%? What does this mean? Is this good news for Foley or Malloy? Why? Notice anything else of interest? What is the probability that Foley wins? Some key terms: Subjects: entities that we wish to measure. Population: the total set of subjects that we wish to study. Sample: a subset of the population. Variable: a characteristic of the subject Design: the plan to obtain the data. Inference: a decision or generalization based on the sample about the population. Probability: branch of mathematics that deals with randomness and chance Descriptive Statistics: methods for summarizing data Inferential Statistics: methods for making decisions or generalizations about a population. Parameter: a numerical summary of the population. Statistic: a numerical summary of the sample. General methodology: a researcher wants to know about a parameter. Because of limited resources (time, money, etc.) the researcher takes a (representative) sample from the population, calculates the statistics that will enable the estimation of the parameter, then makes an inference about the population based on the statistics and probability. The researcher may also include graphs charts or tables to help describe the findings. Randomness ensures a representative sample Random number generation Each subject of the population has an equal chance of being included in the sample. Ex. Polio Vaccine Trial The trial was conducted by the National Foundation for Infantile Paralysis (NFIP). First a sample of 3 grade children was selected, all of whose parents consented to vaccination. The sample would be randomly divided into two groups. One group would be given the polio vaccination; the other group would be given a placebo (three injections of inert saltwater that would appear identical to the three injections of the real vaccine). Additionally, none of the participants would know the group identity--not the child, not the parents, and not the examining doctors. The results are listed in the table below. Does this provide evidence at the 1% significance level that the polio vaccine lowers the risk of polio? http://wps.aw.com/wps/media/objects/14/15269/projects/ch12_salk/index.html Treatment Vaccine Placebo Sample size 200745 201229 Subjects: grade 3 children Population: all children (around the same age?) Sample: the 400,000+ children given a treatment. Variables: Vaccine or placebo and Polio or not polio Design: randomized trial (see below). Polio 57 142 Inference: Comparing the true / population proportion of polio for vaccinated and vaccinated groups Probability will be stated with the inference as a significance level, like 1% or 5% Parameter: the proportion of all vaccinated children who get polio and the proportion of all non-vaccinated children who get polio Statistic: 57/200475 and 142/201229 are sample proportions This type of design is known as a randomized control experiment. The randomization tends to nullify all effects (confounding variables) except the treatment effect. An experiment of this type, in which both the subjects and the evaluators are ignorant of the treatment/control status, is known as a double-blind experiment. The randomized control, double-blind design is considered the gold standard of statistical designs. Univariate: data set with observations on a single variable Bivariate: data set with observations on each of two variables Multivariate: data set with observations on more than one variable Two types of data: qualitative and quantitative Qualitative: non numeric sometimes called categorical, because the data can be divided up into classes or categories. Can be further divided into nominal and ordinal data. Ordinal is not numeric in the true sense but the order of classes is inherent and important. Ex. Grade in school: Freshman, Sophomore, Junior, Senior, could be coded as 1, 2, 3, 4 Nominal: the order is arbitrary. Favorite color: Blue, Red Green, Other Quantitative: numeric data. Can be divided into discrete and continuous. Discrete: finite or countably infinite number of possibilities. 0, 1, 2, … Continuous: range of possibilities form an interval. (0,1) Examples: Population: Students at SCSU Sample: Take a random sample of 100 students Possible variables: GPA: quantitative, continuous. Height in inches: quantitative, discrete Hair color: qualitative, nominal Home area code: qualitative, nominal ( it is numeric, but the numbers do not count or measure) Letter grade you received in Calculus I: qualitative and ordinal. (order matters) This is what you should have learned in a previous Statistics class. In this class we will be learning about Regression Modeling (Single dependent variable and one or more independent variables and some Design issues.