Day 1 - Southern Connecticut State University

advertisement
MAT 526 Day 1
Probability is the branch of mathematics that deals with randomness and uncertainty.
It may be the most applicable of the mathematical disciplines.
“It is unlike other branches of math because they may not have Probability on Mars.”
“Statistics … the most important science in the whole world: for upon it depends the practical
application of every other science and of every art; the one science essential to all political and
social administration, all education, all organization based upon experience, for it only gives the
result of our experience.
Florence Nightingale (1820 – 1910)
Statistics: the science that deals with the collection, description, analysis and interpretation of
data. (Mugno, 1997)
When most people hear statistics they think of descriptive statistics. Numbers or graphics that
summarize a data set. Examples include batting average, median income, disease prevalence,
etc… Descriptive statistics are important but usually pretty simple.
Inferential statistics are when we use sample data to make estimates, decisions and
predictions about a larger data set or population. This is why statistics is so important.
Four major themes to keep in mind throughout the course.
Design: How the data is collected is extremely important and will affect greatly your analysis
and interpretation. Designed experiments, surveys, polls, etc…
Description: Can be very important in how your results are perceived by the reader. Bad
practices can cause very misleading results. Descriptive statistics, graphs, tables, etc …
Analysis: It is very important to use a proper methodology and which descriptive statistics are
appropriate. Weighted, biased, etc…
Interpretation: inferences or making decisions based on design, descriptive statistics and
analysis. This is what makes statistics so important and powerful, especially in today’s data
driven society.
Ex.
Governor’s Race
In the governor’s race, Foley leads 89 – 7 percent among Republicans while
Malloy leads 88 – 9 percent among Democrats. Independent voters shift from 50 – 41
percent for Malloy last week to 55 – 33 percent for Foley today. Another 6 percent of
likely voters are undecided and 11 percent of those who name a candidate say they might
change their mind.
Men back Foley 51 – 43 percent, while women back Malloy 48 – 43 percent.
Malloy gets a split 44 – 41 percent favorability from Connecticut likely voters,
compared to 47 – 34 percent last week. Foley’s 48 – 34 percent favorability compares to
45 – 33 percent last week.
“The late deciders are breaking for Tom Foley. There has been a big shift among
independents in the final week of the campaign toward the Republican,” Dr. Schwartz said.
“Dan Malloy’s unfavorables have risen to the point where he gets a mixed
favorability rating for the first time.
“For Foley to win, he needs to win the independent vote by a substantial margin,
which he is now doing for the first time. But this race is too close to call. With 6 percent
still undecided there is still room for movement.
“Foley has the numerical lead and the momentum but Malloy still could pull this
out.”
From October 25 – 31, Quinnipiac University surveyed 930 Connecticut likely
voters with a margin of error of +/- 3.2 percentage points.
The Quinnipiac University Poll conducts public opinion surveys in New York,
New Jersey, Connecticut, Pennsylvania, Florida, Ohio and the nation as a public service
and for research.
What type of design was used to collect these data?
How are these data described?
How was the data analyzed?
What conclusions (inferences) are drawn from the data?
The design here is a poll. Specifics are given as to how the poll participants were selected. This
can be very important and greatly affect the results and interpretations. We do know about 3
variables here: Gender, Party affiliation and who are you likely to vote for?
The data are described by percentages (relative frequencies) and the total is given.
The data analysis was very simple, they just found the percents of responses.
The conclusions are generalizing these results to the general population.
Note:
Polls are not always reliable. But assume that this one is.
Margin of error is given as ±3.2%? What does this mean?
Is this good news for Foley or Malloy? Why?
Notice anything else of interest?
What is the probability that Foley wins?
Some key terms:
Subjects: entities that we wish to measure.
Population: the total set of subjects that we wish to study.
Sample: a subset of the population.
Variable: a characteristic of the subject
Design: the plan to obtain the data.
Inference: a decision or generalization based on the sample about the population.
Probability: branch of mathematics that deals with randomness and chance
Descriptive Statistics: methods for summarizing data
Inferential Statistics: methods for making decisions or generalizations about a population.
Parameter: a numerical summary of the population.
Statistic: a numerical summary of the sample.
General methodology: a researcher wants to know about a parameter. Because of limited
resources (time, money, etc.) the researcher takes a (representative) sample from the
population, calculates the statistics that will enable the estimation of the parameter, then
makes an inference about the population based on the statistics and probability. The
researcher may also include graphs charts or tables to help describe the findings.
Randomness ensures a representative sample
Random number generation
Each subject of the population has an equal chance of being included in the sample.
Ex. Polio Vaccine Trial
The trial was conducted by the National Foundation for Infantile Paralysis (NFIP). First a sample
of 3 grade children was selected, all of whose parents consented to vaccination. The sample
would be randomly divided into two groups. One group would be given the polio vaccination;
the other group would be given a placebo (three injections of inert saltwater that would appear
identical to the three injections of the real vaccine). Additionally, none of the participants would
know the group identity--not the child, not the parents, and not the examining doctors. The
results are listed in the table below. Does this provide evidence at the 1% significance level that
the polio vaccine lowers the risk of polio?
http://wps.aw.com/wps/media/objects/14/15269/projects/ch12_salk/index.html
Treatment
Vaccine
Placebo
Sample size
200745
201229
Subjects: grade 3 children
Population: all children (around the same age?)
Sample: the 400,000+ children given a treatment.
Variables: Vaccine or placebo and Polio or not polio
Design: randomized trial (see below).
Polio
57
142
Inference: Comparing the true / population proportion of polio for vaccinated and vaccinated
groups
Probability will be stated with the inference as a significance level, like 1% or 5%
Parameter: the proportion of all vaccinated children who get polio and the proportion of all
non-vaccinated children who get polio
Statistic: 57/200475 and 142/201229 are sample proportions
This type of design is known as a randomized control experiment. The randomization tends to
nullify all effects (confounding variables) except the treatment effect.
An experiment of this type, in which both the subjects and the evaluators are ignorant of the
treatment/control status, is known as a double-blind experiment. The randomized control,
double-blind design is considered the gold standard of statistical designs.
Univariate: data set with observations on a single variable
Bivariate: data set with observations on each of two variables
Multivariate: data set with observations on more than one variable
Two types of data: qualitative and quantitative
Qualitative: non numeric sometimes called categorical, because the data can be divided up into
classes or categories. Can be further divided into nominal and ordinal data.
Ordinal is not numeric in the true sense but the order of classes is inherent and important.
Ex. Grade in school: Freshman, Sophomore, Junior, Senior, could be coded as 1, 2, 3, 4
Nominal: the order is arbitrary. Favorite color: Blue, Red Green, Other
Quantitative: numeric data. Can be divided into discrete and continuous.
Discrete: finite or countably infinite number of possibilities.
0, 1, 2, …
Continuous: range of possibilities form an interval.
(0,1)
Examples:
Population: Students at SCSU
Sample: Take a random sample of 100 students
Possible variables:
GPA: quantitative, continuous.
Height in inches: quantitative, discrete
Hair color: qualitative, nominal
Home area code: qualitative, nominal ( it is numeric, but the numbers do not count or measure)
Letter grade you received in Calculus I: qualitative and ordinal. (order matters)
This is what you should have learned in a previous Statistics class. In this class we will be
learning about Regression Modeling (Single dependent variable and one or more
independent variables and some Design issues.
Download