MAT 107 Day 1 - Southern Connecticut State University

advertisement
MAT 320
Probability is the branch of mathematics that deals with randomness and uncertainty.
It may be the most applicable of the mathematical disciplines.
“It is unlike other branches of math because they may not have Probability on Mars.”
“Statistics … the most important science in the whole world: for upon it depends the practical
application of every other science and of every art; the one science essential to all political and
social administration, all education, all organization based upon experience, for it only gives the
result of our experience.
Florence Nightingale (1820 – 1910)
Statistics: the science that deals with the collection, description, analysis and interpretation of
data. (Mugno, 1997)
When most people hear statistics they think of descriptive statistics. Numbers or graphics that
summarize a data set. Examples include batting average, median income, disease prevalence,
etc… Descriptive statistics are important but usually pretty simple.
Inferential statistics are when we use sample data to make estimates, decisions and
predictions about a larger data set or population. This is why statistics is so important.
Four major themes to keep in mind throughout the course.
Design: How the data is collected is extremely important and will affect greatly your analysis
and interpretation. Designed experiments, surveys, polls, etc…
Description: Can be very important in how your results are perceived by the reader. Bad
practices can cause very misleading results. Descriptive statistics, graphs, tables, etc …
Analysis: It is very important to use a proper methodology and which descriptive statistics are
appropriate. Weighted, biased, etc…
Interpretation: inferences or making decisions based on design, descriptive statistics and
analysis. This is what makes statistics so important and powerful, especially in today’s data
driven society.
Ex.
THE FOLLOWING RESULTS ARE BASED ON A POLL OF 664 LIKELY REPUBLICAN PRIMARY
VOTERS
1REP. If the Republican primary for Governor were being held today, would you
vote for Tom Foley, Mike Fedele, or Oz Griebel? (If undecided q1REP) As of
today, would you say that you lean a little more toward Foley, Fedele, or
Griebel? (This table includes "Leaners".)
LIKELY REP PRIMARY VOTERS
Tot
Men
Wom
Foley
Fedele
Griebel
SMONE ELSE(VOL)
WLDN'T VOTE(VOL)
DK/NA
38%
30
17
14
43%
27
18
12
33%
33
16
17
From August 3 - 8, Quinnipiac University surveyed 664 Connecticut Republican likely
primary voters with a margin of error of +/- 3.8 percentage points and 464 Democratic
likely primary voters with a margin of error of +/- 4.6 percentage points. These likely
voters were selected from lists of people who have voted in past elections.
The Quinnipiac University Poll conducts public opinion surveys in New York,
New Jersey, Connecticut, Pennsylvania, Florida, Ohio and the nation as a public service
and for research.
What type of design was used to collect these data?
How are these data described?
How was the data analyzed?
What conclusions (inferences) are drawn from the data?
The design here is a poll. Specifics are given as to how the poll participants were selected. This
can be very important and greatly affect the results and interpretations. We do know about 3
variables here: Gender, Party affiliation (R) and who are you likely to vote for?
The data are described by percentages (relative frequencies) and the total is given. They are
placed in a table. (two-dimensional frequency table or contingency table)
The data analysis was very simple, they just found the percents of responses.
The conclusions are generalizing these results to the general population.
Note:
Polls are not always reliable. But assume that this one is.
Margin of error is given as ±3.8%? Is this good news for Foley or Fedele? Why?
DK/NA is 14%. Is this good news for Foley or Fedele? Why?
Notice anything else of interest?
What is the probability that Foley wins?
Some key terms:
Subjects: entities that we wish to measure.
Population: the total set of subjects that we wish to study.
Sample: a subset of the population.
Variable: a characteristic of the subject
Design: the plan to obtain the data.
Inference: a decision or generalization based on the sample about the population.
Probability: branch of mathematics that deals with randomness and chance
Descriptive Statistics: methods for summarizing data
Inferential Statistics: methods for making decisions or generalizations about a population.
Parameter: a numerical summary of the population.
Statistic: a numerical summary of the sample.
General methodology: a researcher wants to know about a parameter. Because of limited
resources (time, money, etc.) the researcher takes a (representative) sample from the
population, calculates the statistics that will enable the estimation of the parameter, then
makes an inference about the population based on the statistics and probability. The
researcher may also include graphs charts or tables to help describe the findings.
Randomness ensures a representative sample
Random number generation
Each subject of the population has an equal chance of being included in the sample.
Ex. Polio Vaccine Trial
The trial was conducted by the National Foundation for Infantile Paralysis (NFIP). First a sample
of 3 grade children was selected, all of whose parents consented to vaccination. The sample
would be randomly divided into two groups. One group would be given the polio vaccination;
the other group would be given a placebo (three injections of inert saltwater that would appear
identical to the three injections of the real vaccine). Additionally, none of the participants would
know the group identity--not the child, not the parents, and not the examining doctors. The
results are listed in the table below. Does this provide evidence at the 1% significance level that
the polio vaccine lowers the risk of polio?
http://wps.aw.com/wps/media/objects/14/15269/projects/ch12_salk/index.html
Treatment
Vaccine
Placebo
Sample size
200745
201229
Polio
57
142
Subjects: 3 grade children
Population: all children
Sample: the 400,000+ children given a treatment.
Variables: Vaccine or placebo and Polio or not polio
Design: randomized trial (see below).
Inference: Comparing the true / population proportion of polio for vaccinated and vaccinated
groups
Probability will be stated with the inference as a significance level, like 1% or 5%
Parameter: the proportion of all vaccinated children who get polio and the proportion of all
non-vaccinated children who get polio
Statistic: 57/200475 and 142/201229 are sample proportions
This type of design is known as a randomized control experiment. The randomization tends to
nullify all effects (confounding variables) except the treatment effect.
An experiment of this type, in which both the subjects and the evaluators are ignorant of the
treatment/control status, is known as a double-blind experiment. The randomized control,
double-blind design is considered the gold standard of statistical designs.
Univariate: data set with observations on a single variable
Bivariate: data set with observations on each of two variables
Multivariate: data set with observations on more than one variable
Concrete vs. conceptual populations
concrete population: when the population really exists.
examples: all college students, all voters, all black widow spiders, all widgets produced in a
factory.
conceptual population: when the population does not actually exist.
examples: all speeds a car can crash at, all altitudes a place can fly at, all temperatures possible
for making widgets.
Enumerative vs. Analytical studies
unchanging and finite vs. futuristic
Enumerative: all college students in 2008, all Toyota Camarys in 2006
Analytical: All college students over the next 3 years, all hybrids until 2010
Chapter 1.2
Two types of data: qualitative and quantitative
Qualitative: non numeric sometimes called categorical, because the data can be divided up into
classes or categories. Can be further divided into nominal and ordinal data.
Ordinal is not numeric in the true sense but the order of classes is inherent and important.
Ex. Grade in school: Freshman, Sophomore, Junior, Senior, could be coded as 1, 2, 3, 4
Nominal: the order is arbitrary. Favorite color: Blue, Red Green, Other
Quantitative: numeric data. Can be divided into discrete and continuous.
Discrete: finite or countably infinite number of possibilities.
0, 1, 2, …
Continuous: range of possibilities form an interval.
(0,1)
Examples:
Population: Students at SCSU
Sample: Take a random sample of 100 students
Possible variables:
GPA: quantitative, continuous.
Height in inches: quantitative, discrete
Hair color: qualitative, nominal
Home area code: qualitative, nominal ( it is numeric, but the numbers do not count or measure)
Letter grade you received in Calculus I: qualitative and ordinal. (order matters)
Graphical Displays (review these if you need to)
For summarizing categorical data the primary displays are Pie Charts and Bar Graphs
Pie Charts: circles where each “slice” represents a category and the size of each slice
corresponds to the proportion or percentage of observations in that category.
Bar Graphs display a vertical bar for each category. The height of the bar is the percentage or
proportion of observations in that category.
The proportion of observations in a class or category is the frequency of observations that fall
in the class divided by the total number of observations. The percentage is the proportion
times 100. Proportions and percentages are both known as relative frequencies.
A frequency table is a listing of all the classes and their corresponding frequencies. It is
necessary to create a frequency table before making a pie chart or bar graph or histogram
(later).
Examples:
Nominal: Cell phone carrier.
Ordinal: Year at college
Graphs for quantitative variables.
Dot plots: a dot for each observation is place above the appropriate number in a number line.
Stem and Leaf Plots: each observation is represented as a stem and leaf. The stem usually
consists of all the digits of the number except for the last one, which is the leaf.
Dot plots and stem and leaf plots are only reasonable for small data sets.
Ex. Heights.
For larger Data sets a Histogram is used.
A Histogram is a graph that uses bars to portray the frequencies or relative frequencies of the
possible outcomes for a quantitative variable.
Steps for constructing a Histogram
1. Divided the range of the data into classes, non-overlapping intervals of equal length. For a
discrete set of data with a small number of values, use the actual values as the classes.
2.
Count the number of observations in each class, forming a frequency table.
3.
On the horizontal axis, label the values or the endpoints of the intervals. Draw a bar over
each class or value with height equal to its frequency (or percentage). The vertical axis
should be scaled and labeled with either the raw or relative frequencies. Both the
horizontal and vertical axes should be scaled so that all the classes and frequencies fit and
are disguisable.
Histograms are one of the most misunderstood concepts in statistics. A histogram is simply a
bar graph of the frequency distribution of the data.
Histogram example.
Heights of students (inches) from a previous class.
Heights
57
59
66
60
70
47
61
55
57
71
Heights
74
48
70
67
62
62
58
55
62
68
We have 20 data points and want to break them up into 4 or 5 classes.
Guideline: if n is the total number of observations and k is the number of classes then k = √n
The range of the data = max – min = 74 – 47 = 27. Note that there are 28 integers between 47
and 74 if you include the endpoints.
So, 4 classes that are 7 units long will work fine.
The classes would then be [47, 53], [54, 60], [61, 67], [68, 74]. Note that all the data points are
included and no point is in more than one class.
The frequency distribution would then be:
Class
Freq
[47, 53]
2
[54, 60]
7
[61, 67]
6
[68, 74]
5
Note that the total is 20. Now just make a bar graph of the frequency distribution.
Histogram of Heights
8
7
Number
6
5
4
Freq
3
2
1
0
[47, 53]
[54, 60]
[61, 67]
Height in inches
[68, 74]
Note that the axes are labeled and the histogram is titled and there are no gaps in the bars.
For Quantitative data there are 3 kinds of plots:
Dot plots
Stem and leaf plots
Histograms
Dot plots and Stem and leaf plots are used for small data sets (under 50 observations).
Histograms are more flexible, because of classes
Histograms and dot plots and stem-and-leaf-plots allow us to see the shape of the distribution.
1. Outlier detection: rare or unusual observations
2. The mode or most common observation class. unimodal vs. bimodal.
3. Symmetry of the dataset.
a.
Symmetric: when you divide the histogram down the middle, the left side of
is a mirror image of the right side.
b.
Skewed left: if the left tail of the histogram is longer than the right tail. The
small observations are more extreme than the large observations.
c.
Skewed right: if the right tail of the histogram is longer than the left tail. The
large observations are more extreme than the small observations.
Another Histogram example:
page 22 #20
20.
0|123334555599
1|00122234688
2|1112344477
3|0113338
4|37
5|23778
Bin
(0,1000]
(1000, 2000]
(2000, 3000]
(3000, 4000]
(4000, 5000]
(5000, 6000]
Frequency
13
10
10
7
2
5
23 points < 2000
23/47 = .489
17 points between 2000 and 4000
17/47 = .362
Positive skewed (right) (Note it really should have specified left or right endpoint inclusion)
Measures of Central Tendencies
Mean: average
Sample mean: the average of the sample.
This is a capital sigma: ∑
It means to take the sum.
Symbolically we write the formula for the sample mean as:

x
x  n = x-bar
x-bar is used to estimate μ = population mean
Median : the middle
Sample median: the middle of the sample
Symbolically we will write the sample median as x~.
To find the sample median:
1.
Sort the n observations (in ascending order)
2.
If n is odd, let k = (n + 1) / 2. Then x~ = kth observation
3.
If n is even, let k = n / 2 and j = (n + 2)/2.
Then x~ is the average of the kth and the jth observations.
x~ is used to estimate μ~ = the population median.
Outlier: an observation that falls outside pattern of data.
Ex.
The following sample are 10 scores from a test given last semester.
75
84
86
68
93
97
32
90
80
70
Find the mean. Find the median. Make a dot plot.
Are there any outliers? If so identify them.
∑x = 775 and n = 10, so the mean = 775 / 10 = 77.5
Mean = 77.5
Sort the data.
32
68
70
75
80
84
86
90
93
97
n = 10, 10 / 2 = 5 and 12/2 = 6, so the median is the average of the 5th and 6th observations
= (80 + 84)/ 2 = 82.
Median = 82
Dot plot:
●
● ●
● ● ●● ● ● ●
30
40
50
60
70
80
90
32 seems to be an outlier.
How can outliers affect the mean and the median?
Assume that the person who got the 32 drops the class, because the student got really sick. The
remaining (sorted) data looks like:
68
70
75
80
84
86
90
93
97
So now are n = 9 and ∑x = 743, so
Mean = 743 / 9 = 82.5
Median = 5th observation = 84.
The mean increased 5 points, but the median only increased 2 points.
The mean is a weighted measure, whereas the median is a resistant measure.
Resistant measures if extreme observations have little if any effect.
To calculate the mean and the median as well as some other important statistics on the TI83/
TI84.
1.
Enter the data into a list.
Hit [STAT]
Choose 1. Edit
In L1, enter the years. Ex. 75 [ENTER] 84 [ENTER] … 70 [ENTER]
2.
Hit [STAT]. Hit the right arrow to highlight CALC. Choose
1:1-Var Stats hit [ENTER]
The screen should read: 1-Var Stats (then hit L1 [2cd] [1]),
so that the screen reads: 1-Var Stats L1.
Hit [ENTER]
The output should look like:
1-Var Stats

x  77.5
∑ x = 775
∑ x2 = 63183
Sx = 18.62047857
σx = 17.66493702
n = 10 (to see more hit the down arrow)
minX = 32
Q1 = 70
Med = 82
Q3 = 90
maxX = 97
Q1 is the first quartile = 25% percentile.
Sx = sample standard deviation
More on these later.
Q3 is the third quartile = 75% percentile.
σx = population standard deviation.
Trimmed Mean of p percent: removes the top and bottom p% observations and then finds the
mean. This is a compromise between x-bar and x~. It is a weighted measure that is more
resistant to outliers then x-bar.
Ex. The 10% Trimmed mean of the data below is:
32
68
68
70
70
75
75
80
80
84
84
86
86
90
90
93
93
97
10% trimmed mean = 80.75
Ex.
An airline company is wondering about the number of cancellations it receives for a specific
commuter flight. The airline takes a random sample of 15 days. The data is listed below. Find
the mean and the median for the sample. Make a dot plot of the data. Are there any outliers?
Describe the symmetry of the data.
4, 24, 17, 17, 9, 12, 9, 12, 13, 14, 14, 15, 15, 16, 16.
x-bar = 13.8
x~ = 14
Another way to determine symmetry:
Data are symmetric if x-bar = x~ ( does not have to be exact, within 10%)
Data are skewed right if x-bar > x~
Data are skewed left if x-bar < x~
These data appear to be symmetric.
There are no clear outliers, but one could argue that both 4 and 24 are outliers.
Measures of Variability
First we measured the center of the data, the mean and the median.
We also looked at the shape of the data, unimodal or bimodal, symmetric or skewed.
No we look at how spread out the data is.
The first measure is simple but does not tell us much about the spread.
The range is the difference between the largest and smallest observations.
Range = max - min
A better measure would summarize the deviations from the center of the data.
A deviation of an observation x from the mean xbar is (x - xbar), the difference.
A deviation is positive if x is bigger than xbar.
A deviation is negative if x is smaller than xbar.
Unfortunately if we sum all the deviations of any data set, we get 0, because of how xbar is
defined.
So before we sum up the deviations, we square them, which makes them all positive.
The average of these squared deviations is called the variance and is denoted by s 2.
The formulae for s and s2 are given in your text page 32 and 34.
The square root of s2 is s which is called the standard deviation.
The bigger the standard deviation, the more spread out the data is.
We use the standard deviation more often then the variance because the standard deviation is
in the units of the problem and the mean.
s2 is used to estimate σ2 = population variance.
s is used to estimate σ = population standard deviation.
We will not use the formulae much because your calculator will do it for you.
Remember under 1-VAR_STATS there was Sx, which is the standard deviation. Technically this is
the sample standard deviation which is what we want.
Ex.
A random sample of 10 grades is given below. Calculate the mean, and standard deviation of
the sample.
Grades
(x)
95
87
45
76
76
82
68
63
92
88
x-bar =
s^2 =
s=
77.2
233.067
15.267
(x - xbar)
17.8
9.8
-32.2
-1.2
-1.2
4.8
-9.2
-14.2
14.8
10.8
(x - xbar)^2
316.84
96.04
1036.84
1.44
1.44
23.04
84.64
201.64
219.04
116.64
0.000
2097.600
233.067
15.267
Interpreting the standard deviation.
In general, the greater the spread the greater s is.
Also, s = 0 means that there is no deviation, which only happens when all the observations are
the same.
For example, if your data set was: 20, 20, 20, 20, 20, 20.
S = 0.
Proposition:
Let x1, x2, x3, …, xn be a sample and c be any non-zero constant then,
a.
if y1 = x1 + c, y2 = x2 + c, …, yn = xn + c then Sy2 = Sx2 and
b.
if y1 = cx1, y2 = cx2, …, yn = cxn then Sy2 = c2Sx2 and Sy = |c|Sx
Measures of relative Standing and Boxplots
The pth percentile is a value such that p percent of the observations fall below or at that value.
You have probably seen percentiles on standardized tests.
The median is a percentile, the 50th.
Three useful percentiles that we will use are the quartiles.
The median is the second called Q2.
The first quartile is called Q1 and is the 25th percentile. It is also the median of the lower half of
the data.
The median is the second called Q2.
The Third quartile is called Q3 and is the 75th percentile. It is also the median of the upper half
of the data.
The TI calculators can calculate all 3 of them for you.
Some people look at Q0 as the minimum observation and Q4 as the maximum observation.
These 5 numbers together are called the 5-number-summary of the data.
These numbers can be used to detect outliers and create visual display of the data called a box
plot.
First we need to calculate the Inter Quartile Range (IQR = Q3 - Q1 = fourth spread = fs).
Constructing a box-plot
1. Calculate the 5-Number Summary.
2. A box is drawn from Q1 to Q3. (vertical lines at the quartiles)
3. A (vertical) line is drawn at the median.
4. A whisker (horizontal line) is drawn from Q1 to the smallest observation that is bigger than
Q1 - 1.5*IQR. A whisker (horizontal line) is drawn from Q3 to the largest observation that is
smaller than Q3 + 1.5*IQR. Any observation that is outside the whiskers, either less than Q1 1.5*IQR or Q3 + 1.5*IQR, is a potential outlier.
Ex.
Grades (x)
95
87
35
76
76
82
68
63
92
88
Sorted x
95
92
88
87
82
76
76
68
63
35
Median = (76 + 82) / 2 = 79
Q1 = 68
Q3 = 88
Min = 35
Max = 95
IQR = 88 – 68 = 20
1.5 * IQR = 30
68 – 30 = 38
Since 35 < 38 it is a potential outlier. (35 is more extreme than 38)
88 + 30 = 118
Since 95 < 118 it is NOT a potential outlier. (95 is less extreme than 118)
Note that your book distinguishes between outliers and extreme outliers:
Outlier: any observation x that is more than 1.5 IQR from the closest quartile (Q1, Q3).
Q1 – x > 1.5IQR or x – Q3 > 1.5IQR
Extreme Outlier: any observation x that is more than 3 IQR from the closest quartile (Q1, Q3).
Q1 – x > 3IQR or x – Q3 > 3IQR
They mark outliers with a solid circle and extreme outliers with an open circle.
Boxplots can be used to compare two sets of observations. They are usually graphed next to
each other.
Also note that Boxplots can be vertical or horizontal.
Misleading your audience with statistics.
Guidelines for Constructing Effective Graphs
1. Label both axes and provide title.
2. Compare relative sizes accurately, scale correctly! Y axis should start at 0
3. Use standard shapes and symbols.
4. Displaying more than one group on a single graph can be difficult.
5. Mean vs. Median (Baseball examples)
6. Percents vs. Frequencies
7. Simpson’s Paradox.
Do not’s
1. Do not use scale breaks in any of your axes!
2. When making a histogram, uses classes and bars of the same width.
3. Do not make inferences about the population from one simple statistic like the mean,
especially when you have a small sample size.
Simpson’s Paradox.
A baseball example:
Batting average = number of hits / number of qualifying at bats.
Who has the better batting average?
A (Hits/ AB)
Avg
B (Hits/ AB)
Avg
Vs Lefties
Vs Righties
200/500
.400
5/10
.500
30/100
.300
210/590
.360
Batter B is better vs. left-handed pitching and better vs. right-handed pitching.
Totals
A (Hits/ AB)
Avg
B (Hits/ AB)
Avg
Vs Lefties
200/500
.400
5/10
.500
Vs Righties
30/100
.300
210/590
.356
But overall batter A has a better average.
Simpson’s Reversal of Inequalities:
• E.H. Simpson in 1951 noted that
a A
c C
 and 
b B
d D
But
ac AC

bd BD
Recall that:
Totals
230/600
.380
215/600
.358
a c ac
 
b d bd
Hiring Practices at University of California at Berkeley:
Was there a hiring preference given to males?
Dept
History
Geography
Total
Men
1/5 (20%)
6/8 (75%)
7/13 (54%)
Women
2/8 (25%)
4/5 (80%)
6/13 (46%)
Note first that the overall percentage of males hired (54%) is greater that that for
females (46%).
However, both the history department hired a greater percentage of females
(25% to 20%) as did the Geography department (80% to 75%).
Two important aspects to note:
First, he History department and the Geography department do not talk to each
other before making a hire.
Second, more females applied for the jobs that were harder to get, History only
had 3 positions available, whereas Geography had 10 positions available.
Knowing these to facts there is no evidence of gender discrimination (in favor of
males anyway)
This a piece of data that was a real case actually brought before the California
legislature, where even though they were presented similar evidence concluded
that, something like this could never happen and Berkely was guilty of gender
bias.
a 2a

b 2b
and
c 3c

d 3d
does that mean that
a  c 2a  3c

b  d 2b  3d
Example 1
1 2

2 4
1 3

3 9
but
11 2  3

23 49
2 5

5 13
.4 > .385
Example 2
1 3

2 6
1 2

3 6
but
11 3  2

23 66
2 5

5 12
.4 < .417
Conclusions:
Be very careful how and when you add ratios or fractions.
Be very careful how you interpret you findings when adding ratios or fractions.
Bibliography
Simpson. E.H. (1951), “The interpretation of interaction in contingency tables.”
http://en.wikipedia.org/wiki/Simpson%27s_paradox
http://en.wikipedia.org/wiki/Low_birth_weight_paradox
http://plato.stanford.edu/entries/paradox-simpson/
Download