P - FacStaff Home Page for CBU

advertisement
1. Basic Probability
Probability provides the foundation for statistical inference. For our purposes,
probability will mean objective probability as opposed to subjective probability,
which describes an individual’s personal judgement about how likely a particular
event is to occur, and is not based on any precise computation.
Types of Objective Probability
(1) Classical or theoretical (a priori)
If an event can occur in N mutually exclusive (cannot occur simutaneously)
and equally likely ways, and if m of these possess a trait E, the probability of
m
the occurence of E is , and is written
N
m
P (E) = .
N
Example. In rolling a fair die, if E is rolling an odd number,
3 1
P (E) = = .
6 2
(2) Relative frequency or experimental (a posteriori)
If some process is repeated a large number of times, say n times, and if some
resulting event E occurs m times, the relative frequency of occurrence of E,
m
, will be approximately equal to the probability of E, i.e.,
n
m
P (E) = .
n
Example. Not knowing whether a die is fair, we roll it 6000 times and note
that 4 occurs 930 times, so
970
P (E) =
= .161667.
6000
1
2
3 Properties of Probability
Given a process with n mutually exclusive outcomes (events) E1, E2, . . . , En:
(1) P (Ei) ≥ 0.
(2) P (E1) + P (E2) + · · · P (En) = 1.
(3) P (Ei or Ej ) = P (Ei) + P (Ej ). (since the events are mutually exclusive)
Example. A random sample of 1000 12th grade students was taken from
a random sample of Tennessee High Schools. These schools were classified as
poor (P ), average (A), or superior (S). The students were then given a math
achievement test with their scores categorized as low (L), medium (M ), or high
(H). The results are summarized in the following table.
Quality of H.S.
Score P A S Total
L
105 60 55
220
M
70 175 145
390
H
25 65 300
390
Total 200 300 500 1000
Note that P , A, and S are mutually exclusive categories, as are L, M , and H.
P (A) =
300
= .3
1000
This is called a marginal probability since the numerator is from the margin.
P (H|P ) =
(read as the probability of H given P )
25
= .125
200
This is a conditional probability since a subset of the total group is the denominator (i.e., in the language of sample spaces, the sample space is reduced).
1. BASIC PROBABILITY
3
Quality of H.S.
Score P A S Total
L
105 60 55
220
M
70 175 145
390
H
25 65 300
390
Total 200 300 500 1000
P (P ∩ H) =
25
= .025
1000
This is called a joint probability.
Computing a probability from other probabilities:
Multiplication Rule.
(
P (B) · P (A|B), if P (B) 6= 0
P (A ∩ B) =
.
P (A) · P (B|A), if P (A) 6= 0
From our example,
200
P (P ) =
= .2,
1000
Therefore
P (H) =
390
= .39,
1000
P (P |H) =
25
= .0641.
390
(
P (H) · P (P |H) = (.39)(.0641) = .0250
P (P ∩ H) =
P (P ) · P (H|P ) = (.2)(.125) = .0250
Alternatively, a definition of conditional probability is
P (A ∩ B)
P (A|B) =
.
P (B)
In our example,
P (P ∩ H) .025
P (P |H) =
=
= .0641.
P (H)
.39
.
4
Quality of H.S.
Score P A S Total
L
105 60 55
220
M
70 175 145
390
H
25 65 300
390
Total 200 300 500 1000
Definition. Events A and B are mutually exclusive if
P (A ∩ B) = 0.
Addition Rule.
P (A ∪ B) = ?.
P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
If A and B are mutually exclusive,
P (A ∪ B) = P (A) + P (B)
A
B
In our example,
P (P ∪ H) = P (P ) + P (H) − P (P ∩ H) = .2 + .39 − .025 = .565.
1. BASIC PROBABILITY
5
Quality of H.S.
Score P A S Total
L
105 60 55
220
M
70 175 145
390
H
25 65 300
390
Total 200 300 500 1000
Definition. Events A and B are independent if
P (A|B) = P (A),
P (B|A) = P (B), or
P (A ∩ B) = P (A) · P (B),
provided P (A) 6= 0 and P (B) 6= 0.
Note.
(1) If any one of the above 3 conditions are true, all three are true.
(2) If two events are independent, knowing whether one has occurred gives no
information about the other.
In our example, are M and A independent?
300
175
P (A) =
= .3 and P (A|M ) =
= .448718.
1000
390
Thus these events are not independent.
Example. In a high school class, 24 of 60 girls and 16 of 40 boys wear
glasses. For this class, are the events E of wearing glasses and B of being a boy
independent?
24 + 16
40
16
P (E) =
=
= .4 and P (E|B) =
= .4
60 + 40 100
40
Thus wearing glasses and being a boy are independent.
6
Quality of H.S.
Score P A S Total
L
105 60 55
220
M
70 175 145
390
H
25 65 300
390
Total 200 300 500 1000
Definition. Two mutually exclusive events are complementary if the sum
of their probabilities is 1. We denote the complement of an event A by A.
P (A) + P (A) = 1 or P (A) = 1 − P (A).
In our example,
P (P ) +
P (A) + P (S)
|
{z
}
=P (A∪S) since A∩S=φ
so
= P (P ) + P (A ∪ S) = .2 + (.3 + .5) = 1,
P = A ∪ S,
P (P ) = 1 − P (P ) = 1 − .2 = .8, and
P (P ) = P (A ∪ S) = P (A) + P (S) = .3 + .5 = .8
2. SHAPE OF THE DATA AND HISTOGRAMS
7
2. Shape of the Data and Histograms
The following data are the numbers of cycles to failure of aluminum test coupons
subjected to repeated alternating stress at 21,000 psi, 18 cycles per second.
Aluminum test coupons are small wafers of aluminum used for testing purposes.
What can you say about this data? What questions are raised? How can we
make this data easier to understand?
Give it an order. How do we do this?
TI-83 or 84 Plus
Go to STAT>1:Edit and enter the data in L1. Then hit STAT>2:SortA(
followed by 2nd 1 ) to get SortA(L1) and then hit ENTER to get the ordered
list.
Excel
Open the file coupon.xls which contains the data in Column A. Copy this
data to Column B and then select Column B. Choose Data>Sort... If you
get a Sort warning, choose Continue with the current selection and
then choose Sort by Column B Ascending. Column B is now sorted. Save
this file as coupons.xls. The sorted data is at the top of the next page.
8
Now what can we say about this data? Are any new questions raised? How
can we make this data yet easier to understand?
Histograms
We group the data into bins and create a histogram. Although the individual
items lose their identity and are given the value of the bin (the midpoint of the
endpoints of the bin), we are able to get a better idea of the overall shape of
the data.
TI-83 or 84 Plus
Go to Y= and clear any functions there, and then choose STAT PLOT by hitting
2nd Y=. Choose 1:Plot1 and make sure it is on. Choose the histogram
for type, L1 for Xlist, and 1 for Freq.
Then hit GRAPH to get the histogram above. The shape of the histogram
depends on the settings for Xmin, Xmax, and Xscl (the width of the bins).
2. SHAPE OF THE DATA AND HISTOGRAMS
9
The first bin goes from 300-500, the second from 500-700, etc. If you TRACE
through the graph, you are shown the minimum and maximum values for each
bin along with the number n of data points in the bin. Note that a data point
that falls on a border is always put in the bin on the right.
Now what can we say about our data?
(1) The value in the far left bin appears to be an outlier. How should it be
considered?
(2) Except for the outlier, the values appear to be contiguous.
(3) The data appears to be skewed to the left. (Is it really?) What is likely
larger here, the mean or the median?
(4) The area under the histogram is 200 × 70 = 14000. Each of the 70 elements
contributes an area of 200. The area of the fourth bin from the right (or bin 7)
is 14 × 200 = 2800. This means that the probability of an element being in bin
7 is
2800
P (x in bin 7) =
= .2.
14000
(5) The value of each element in a bin is the midpoint of the bin boundaries.
For this histogram, the bin values are 400, 600, 800, 1000, 1200, 1400, 1600,
1800, 2000, and 2200.
(6) What else?
10
Now let’s reset our window values to the values below, and view the resuting
histogram.
In this view of the data, an outlier is not obvious. The graph clearly appears
skewed to the left. What else do you see?
Excel From the top of Column C enter the increasing boundary values of our
bins: 300, 500, 700, 900, 1100,1 300, 1500, 1700, 1900, 2100, and 2300. Choose
Tools>Data Analysis... If Data Analysis is not under the tools menu,
choose Tools>Add-Ins..., select Analysis ToolPak, and hit OK.When
the Data Analysis window opens, scroll down and choose Histogram. Fill
in the window that opens as below and then hit OK.
2. SHAPE OF THE DATA AND HISTOGRAMS
11
Along with a table giving the number of entries in each bin, we get the following
graph.
But this is not a histogram. It is a bar chart since the bars do not touch. It is
also missing the area concepts discussed earlier that give rise to probabilities. I
suggest reading the article at the following URL:
http://www.stat.uiowa.edu/∼jcryer/JSMTalk2001.pdf
The article ends with “Friends don’t let friends use Excel for statistics.”
Shodor Interactivate
Shodor is a national resource for computational science education. It’s web site
is located at
http://www.shodor.org/
Under the Activities and Lessons menu, choose Interactivate, which
leads you to
http://www.shodor.org/interactivate/
12
This site contains a wealth of Java-based courseware. Click on Tools under
Learners, and then click on Statistics. Then go to and choose the applet named Histogram.
For Select a data set, choose My Data. Click Clear to clear previous
data. Copy the data from Column A of coupons.xls, and paste it into the
open window. then click Update Data. Now enter 200 for Interval Size
and then click Update Interval.
But there are two issues with this applet. We are not able to choose the lower
boundary of the first bin. Also, data elements landing on the boundary of two
bins are placed in the bin on the left. You can see this here by clicking on Show
Frequency Table.
Rice Virtual Lab in Statistics (http://onlinestatbook.com/rvls.html)
The Rice Virtual Lab in Statistics is another online treasure trove for teaching
statistics. Go to
http://onlinestatbook.com/stat sim/histogram/index.html
3. CENTER, SPREAD AND BOX PLOTS
13
From the drop-down menu, choose Enter Data. Again, copy our data into
the window that opens. Then click Accept Data. Enter 200 for Bin Width
and click OK. Enter 300 for Lower Limit of First Bin and click OK.
Notice that the numbers along the vertical axis are now the bin midpoints
rather than the bin boundaries.
3. Center, Spread and Box Plots
Now that we have studied the shape of the data, we find statistics to describe
the center and spread of the data.
TI-83 or 84 Plus
With our data in L1, choose STAT>CALC>1:1-Var Stats and press ENTER.
You get the window on the left below. The mean x is used to describe the center
of the distribution, while the sample standard deviation Sx or the population
standard deviation σx are used to describe the spread.
Pressing the down arrow several times then gives the window on the right with
the five number summary
14
minX - Q1 - Med - Q3 - maxX,
which describes the center (the median) and the spread (the other 4 numbers).
Since we have an even number of elements, 70, the median Med is the average
of the 35th and 36th elements. The 1st-quartile Q1 is the median of the 35
elements less than the median and the 3rd-quartile Q3 is the median of the
35 elements greater than the median. Note that there are several methods
commonly in use for computing the quartiles. So the five-number summary
here is
375 - 1102 - 1436.5 - 1730 - 2265.
We view these numbers graphically with a box plot.
We return to Plot1 under STAT PLOT and choose Boxplot under Type.
This is the middle entry on the second row. Then hit GRAPH to get the box
plot above.
Going to TRACE, we see that the bar in the middle of the box represents the
median, the center of this distribution. The box itself goes from Q1 to Q3 and
illustrates the interquartile range IQR = Q3 − Q1, the distance between the 1st
and 3rd quartiles, which contains the middle 50% of the data. This is a measure
of spread which is not affected by outliers (values more than 1.5 × IQR from
the box) or extreme values (values more than 3 × IQR from the box). A second
measure of spread is the range = maxX − minX.
The TI’s also have a modified box plot ModBoxplot where the whiskers extend
to the last element a maximum of 1.5 × IQR from the box and all outliers and
extreme values are marked with your choice of a box, plus, or point. Often, in
other software, separate marks are used for extreme values. To illustrate the
3. CENTER, SPREAD AND BOX PLOTS
15
modified boxplot, we add the value 100 to our data set so as to get an outlier in
this sense. Redo the 1-Var Stats and then press GRAPH and TRACE through
the box plot.
Excel
Choose Tools>Data Analysis... When the Data Analysis window opens,
scroll down and choose Descriptive Statistics. Fill in the window that
opens as below and then hit OK to get the statistics.
Note that the only standard deviation given is the sample standard deviation.
To get the quartiles for the five number summary, type First Quartile in
16
D19 and Third Quartile in D20. Then type =QUARTILE(A1:A70,1) followed by Enter in E19 and =QUARTILE(A1:A70,3) followed by Enter in
E20 to get:
Shodor Interactivate
Go to
http://www.shodor.org/interactivate/activities/BoxPlot/
For Select a data set, choose My Data. Click Clear to clear previous
data. Copy the data from Column A of coupons.xls, and paste it into the
open window. Then click Update Boxplot. You can use the horizontal
slider to isolate any outliers.
4. Sampling Distribution of the Sample Proportion
Reese’s Pieces Simulation
Reese’s Pieces come in 3 colors – yellow, orange, and brown. The proportions of
these colors is evidently a trade secret. Suppose the actual proportion of orange
pieces is .45. The following applet simulates taking same-sized random sample
of Reese’s pieces, counting the number of orange ones, and then plotting the
proportion on a number line. The applet is located at
http://statweb.calpoly.edu/chance/applets/Reeses/ReesesPieces.html
This is one of several applets at
http://www.rossmanchance.com/applets/
designed by Allan Rossman and Beth Chance, two prominent statistics educators.
The current applet opens withthe following screen.
4. SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION
17
Hit the handle of the candy machine to turn it. 25 candies come out and are
sorted into bins by color. The number ^
p is the proportion of orange pieces. The
number π = .45 is indicated on the graph.
Now press Reset, deselect Animate, and change num samples to 100. Then
click on Draw Samples nine times for a total of 900 samples. Click Plot
Normal Curve and you should see that our distribution approximtes a normal
distribution. Click Count Samples, fill in as below, and click OK.
18
You get the following graph.
We see that, since the standard deviation for the distribution here is roughly
.1, 64.8% of the samples have proportions within 1 standard deviation of the
mean. Click Count Samples off and on, and now choose between .25 and
.65. For these drawings, we have that 94.8% of our samples have proportions
within two standard deviations of the mean. Again, click Count Samples off
and on, and now choose between .15 and .75. For these drawings, we have
that 99.8% of our samples have proportions within three standard deviations
of the mean. Notice how these values of
68.4 — 94.8 — 99.8
match
68.3 — 95.4 — 99.7,
the corresponding percents for the normal curve.
5. THE NORMAL DISTRIBUTION
19
5. The Normal Distribution
We have been using normal curves. Let’s take a closer look at them.
Definition. A normal curve with mean µ and standard deviation σ is the graph of the function
−(x−µ)2
1
f (x) = √ e 2σ2
σ 2π
Thus normal curves are completely determined by their mean and standard
deviation.
For the three normal curves above, one has a mean of 70 and a standard
deviation of 5, another has a mean of 70 and a standard deviation of 10, and
the third has a mean of 50 and a standard deviation of 10. Which is which?
Every normal curve extends from −∞ to ∞ on the horizontal axis with the
area under the curve always equal to one.
20
Consider the normal curve below with mean µ and standard deviation σ, or,
in the case of the standard normal curve with mean 0 and standard deviation
1. Moving from the peak to the right, every normal curve curve changes from
concave down to concave up exactly 1 standard deviation from the mean.
-3
-2
-1
0
1
2
3
From the above, we can also see that every normal curve with mean µ and
standard deviation σ and variable x can be transformed into a standard normal
curve with variable z by the formula
x−µ
z=
.
σ
Similarly, every standard normal curve with variable z can be transformed into
a normal curve with mean µ and standard deviation σ with variable x by the
formula
x = µ + σz.
Suppose we wish to graph the normal curve with mean 20 and standard deviation 5 on the TI. Go to Y1=, clear any functions and turn off any plots and
hit 2nd DISTR>1:normalpdf( for normal probability density function, and
fill in to get Y1=normalpdf(X,20,5), set a window of Xmin=5, Xmax=35,
Xscl=5, Ymin=0, Ymax=.1, and Yscl=.02. Then press GRAPH to get the
graph on the top left of the next page.
5. THE NORMAL DISTRIBUTION
21
If you go to TRACE, you can see the values of this function at different places.
For the standard normal curve, listing the mean and standard deviation is
optional.
Suppose we wish to find P (X ≤ 30). On the HOME screen, press 2nd DISTR>
2:normalcdf( and complete it to
normalcdf(−1 E 99, 30, 20, 5)
to get .977249938. This value is also the area under the curve from −∞ to
30. -1 E 99 is what we enter for −∞, and we use 1 E 99 for ∞. We get the E
(EE on the keyboard) by pressing 2nd EE. To find P (X ≥ 12), use
normalcdf(12, 1 E 99, 20, 5)
to get a probability of .9452007106, which is also the area under the curve
from 12 to ∞. To find P (12 ≤ X ≤ 30), we enter
normalcdf(12, 30, 20, 5)
to get .9224506486. To see that the area under the curve is truly 1, so that
probability = area, enter
normalcdf(−1 E 99, 1 E 99, 20, 5).
For all of the above, we can replace normalcdf( with ShadeNorm( to get a
rounded off version of the same answer. You get ShadeNorm( by presseing
2nd DISTR>DRAW>1:ShadeNorm(. For instance,
ShadeNorm(12, 30, 20, 5)
gives the graph on the right above.
22
6. Sampling Distribution of the Mean
Researchers take samples from a population to gather information about that
population when it is impossible or impractical to study the population in its
entirety. An applet from the Rice Virtual Laboratory for Statistics at
http://onlinestatbook.com/stat sim/sampling dist/index.html
helps to illustrate this.
To begin the applet, press Begin.
choices as below.
In the window which opens, fill in the
Click on Animated several times. What is happening here? Click on Clear
Lower 3 and then click on 10,000 to process 10,000 samples at once. What
do we see here?
7. ESTIMATING THE MEAN OF A POPULATION
23
Now try the other distributions and use your mouse to create your own distribution. What do you see?
Two things seem to be true at this point:
(1) Averages are less variable than individual observations.
(2) Averages are more normal than individual observations.
This leads to the important
Theorem (Central Limit theorem). Draw a simple random sample of size
n from any population whatsoever with mean µ and standard deviation σ.
When n is large, the sampling distribution of the smple mean x is close to
σ
the normal distribution with mean µ and standard deviation √ .
n
This theorem is very important in statistical inference, the process of drawing
conclusions about a population from a sample.
7. Estimating the mean of a population
Suppose a high school has 300 seniors who have taken the SAT and their scores,
along with their gender, are on the next two pages. Assume that the verbal
scores are normally distributed with a standard deviation of 76.59 and that the
scores are ordered according to the Math scores.
To estimate the mean of the verbal scores, we take a simple random sample
(SRS) of 12 of the 300 verbal scores. This is a sample where not only does each
score have an equal chance of being chosen, but each group of 12 scores has an
equal chance of being chosen.
24
7. ESTIMATING THE MEAN OF A POPULATION
25
26
We use the TI to choose 12 random numbers. Press MATH>PRB>5:randInt(
and complete the command to
randInt(1, 300)
and press ENTER 12 times.
We get the numbers
221, 14, 102, 299, 61, 240, 286, 67, 111, 3, 281, 33.
Throw out any numbers that are repeats since 12 distinct numbers are neeeded.
These numbers correspond to verbal scores of
540, 660, 400, 410, 690, 410, 500, 420, 520, 570, 630, 510.
The mean of this sample is x = 521.67. This is also our estimation of the
population mean µ. Just how good is this estimate?
Consider the following. From the standard normal curve,
P (−1.96 ≤ z ≤ 1.96) = .95.
σ
We denote the standard error of the mean σx = √ . Then, prior to taking
n
the sample,
P (µ − 1.96σx ≤ x ≤ µ + 1.96σx) = .95.
Then
−x − 1.96σx ≤ −µ ≤ −x + 1.96σx
x − 1.96σx ≤ µ ≤ x + 1.96σx ,
so after the sample is drawn,
P (x − 1.96σx ≤ µ ≤ x + 1.96σx) = .95.
7. ESTIMATING THE MEAN OF A POPULATION
27
We say
x − 1.96σx ≤ µ ≤ x + 1.96σx
is a 95% confidence interval for the population mean µ. This means that for
every 100 times we take such a simple random sample, we have that
x − 1.96σx ≤ µ ≤ x + 1.96σx
is true an average of 95 times.
Similarly, a 99% confidence interval for the population mean µ is
x − 2.5758σx ≤ µ ≤ x + 2.5758σx
Now let’s return to our example. We have
σ
76.59
σx = √ = √ = 22.11,
n
12
so for the 95% confidence interval we have
521.67 − 1.96(22.11) ≤ µ ≤ 521.67 − 1.96(22.11) or
478.33 ≤ µ ≤ 565.01,
and for the 99% confidence interval we have
521.67 − 2.5758(22.11) ≤ µ ≤ 521.67 − 2.5758(22.11) or
464.72 ≤ µ ≤ 578.62.
Obviously, larger samples will give tighter confidence intervals since σx will be
smaller.
28
These computations are easy to do on the TI-83 and 84 Plus calculators. Press
STAT>TESTS>7:ZInterval. In the window that opens, choose Stats for
Inpt, put in 76.59 for σ, 521.67 for x, 12 for n, .95 for C-Level, and
with the cursor on Calculate, press ENTER to get a window with the confidence interval.
For a 99% confidence interval, just change C-Level to .99.
If your sample is in a list, say L1, choose Data instead of Stats, put in L1
for List, and 1 for Freq.
There is a good applet illustrating all of this at
http://www.stat.tamu.edu/∼west/ph/meanci.html,
one of many at
http://www.stat.tamu.edu/∼west/ph/.
7. ESTIMATING THE MEAN OF A POPULATION
29
The applet we are using is number 14. Put in 12 in the box for n, Normal for
Distribution, 521.67 for Mean, 76.59 for Std.Dev., and then press
Simulate. If you have trouble entering the data, double-click on the number
you wish to replace, then enter the data.
For 100 such samples taken here, we see that 3 of the green bars (changed to
red) do not cover our mean (for the 95% confidence interval), but only one
of these has the extended blue not touch the mean (for the 99% confidence
interval).
if you continue to click on Simulate, the numbers for the cumulative Prop.
contained should approach .95 and .99, respectively, for 95% C and 99%
C.
Finally, the true population mean for our 300 verbal scores is
µ = 513.98.
30
8. Hypothesis Testing
A hypothesis is a statement about one or more populations, and usually deal
with population parameters, such as means or standard deviations.
A research hypothesis is a conjecture that motivates research.
A statistical hypothesis is stated in such a way that it can be evaluated by
appropriate statistical techniques.
A 6-Step Process for Testing a Hypothesis
(1) Understand the data. We will use the SAT Verbal scores.
(2) Make clear your assumptions. In this case, we are assuming that SAT Verbal
scores are normally distributed with a standard deviation of 76.59.
(3) State the hypotheses. Suppose that we believe that the population mean µ
of the verbal scores is 600.
The null hypothesis H0 is the hypothesis to be tested. It is the hypothesis of
“no change,” and we assess evidence against this hypothesis in an attempt to
discredit it. It is often the hypothesis for which erroneous rejection has the
more serious consequences. In a research situation, it may be the complement
of the conclusion the researcher is try to make. Our null hypothesis is
H0 : µ = 600.
The alternative hypothesis H1 is what we accept as true if H0 is rejected, since
it is the complement of H1. Our alternative hypothesis is
H1 : µ 6= 600.
8. HYPOTHESIS TESTING
31
Without knowing which is true, we use sample evidence to accept or reject H0.
accept H0
reject H0
H0 true correct decision Type I error (α)
H0 false Type II error (β) correct decision
α = P (Type I error) = P (reject H0|H0 true)
β = P (Type II error) = P (accept H0|H0 false)
Typically, a Type I error is the most serious, and so
α = significance level.
(4) Get the test statistic where
relevant statistic − hypothesized parameter
test statistic =
standard error of the relevant statistic
Since we are taking a random sample of size 12 from a normal distribution with
a known standard deviation, we have
x − 600
x − 600
√ =
z=
22.11
76.59/ 12
To select our simple random test sample, we again generate 12 different random
integers between 1 and 300, inclusively. We get the numbers
2, 165, 257, 294, 84, 83, 37, 16, 217, 4, 127, 93.
These numbers correspond to verbal scores of
540, 640, 400, 450, 620, 540, 450, 640, 530, 660, 600, 540.
The mean of this sample is x = 550.83.
Thus
z=
550.83 − 600
= −2.2239
22.11
32
(5) Make a statistical decision. The typical a priori levels of significance are
α = .05 and α = .01. These are the most commonly used significance levels
in published research. Since our alternative hypothesis is two-sided because
it contains 6=, these significance levels are divided equally on both sides of
the mean. The diagram below shows the case for α = .05, with probability
α
.05
=
= .025 in each tail. We show the acceptance and rejection regions
2
2
below using both the x scores and the standardized z scores. The α = .05
acceptance region for the x scores is just the 95% confidence interval about
the hypothesized mean µ = 600. Press STAT>TESTS>7:ZInterval and fill
in the window as on the left. Then move the cursor to Calculate and press
ENTER.For the z score acceptance region, do the same except fill in as for the
window on the right.
Since our test statistics of x = 550.83 and z = −2.2239 are in the rejection
region, we reject the null hypothesis of µ = 600.
8. HYPOTHESIS TESTING
33
α .01
The next case we show is for α = .01, with probability =
= .005 in each
2
2
tail.
In this case the test statistics of x = 550.83 and z = −2.2239 are in the
acceptance region, so there is insufficient evidence to reject the null hypothesis.
(6) Find the p-value, the probability that the test statistic woud take a value
as extreme or more extreme than that observed if if H0 were true. The smaller
p is, the more evidence there is against H0. In the past p-values were difficult
to compute. That is why researchers used set significance values like α = .05
and α = .01 which did not require them to find p-values. But p-values are now
readily available to us by calculator and computer.
For our problem, take the TI and enter STAT>TESTS>1:Z-Test, fill in the
window as on the left below, move the cursor to DRAW, and press ENTER.
34
We see that p = .0206. Note that this is smaller than .05 and greater than .01,
causing us to reject the null hypothesis at the α = .05 significance level but not
at the α = .01 significance level.
A more modern approach to hypothesis testing is, instead of using preset significance levels, to find the p-value and then make decisions based on it.
(30) Suppose we believe that there is no way the population mean could be
above 600 for the SAT Verbal scores. Then we would use a one-sided test.
H0 : µ ≥ 600
H1 : µ < 600
(40) To find the boundaries of the one-sided rejection region for an α = .05
significance level, we note that we want the boundary to be the point where
the area (probability) to the left of it is .05. On the TI for the x-score, we can
find this by 2nd DISTR>3:invNorm and then complete this command to
√
invNorm(.05, 600, 76.59/ (12)).
Upon hitting ENTER, we get x = 563.63. If we enter
invNorm(.05),
we get z = −1.6449.
8. HYPOTHESIS TESTING
35
(50)
Since our test statistics of x = 550.83 and z = −2.2239 are in the rejection
region, we reject the null hypothesis of µ = 600.
For α = .01, we use
√
invNorm(.01, 600, 76.59/ (12))
to get x = 548.57 and
invNorm(.01)
to get z = −2.3263.
Since our test statistics of x = 550.83 and z = −2.2239 are in the acceptance
region, we accept the null hypothesis of µ = 600.
36
(60) Again, for the p-value, take the TI and enter STAT>TESTS>1:Z-Test,
fill in the window as on the left below, move the cursor to DRAW, and press
ENTER.Notice that we are taking the option < µ0 for µ :.
We see that the p-value p = .0131 is one-half what is was for the two-sided test.
9. Proper Sampling
Statistical methods assume random samples. Ideally, we want the sampled
population to be like the target population.
Types of Sampling
(a) The simple random- sample (SRS) is a sample where each set of n individuals has an equal chance of being selected. In general, this is the best method.
Example.
(1)The two SRS’s we did earlier.
(b) For a stratified sample, we divide the population into strata by a major
characteristic to ensure representation from each group (male-female, race, etc.).
Example.
(2) Take Verbal scores randomly from 6 males and 6 females.
(3) Take 2 Verbal scores randomly from each column, each of the 6 columns
a strata based on the SAT Math score.
(4) Take 6 Verbal scores randomly from the first page and 6 from the second
page.
9. PROPER SAMPLING
37
If all the strata have the same number of members, then each individual has
the same chance of having their score chosen.
(c) A multistage random sample is constructed by taking a series of simple
random samples in stages. This type of sampling is often more practical than
simple random sampling for studies requiring “on location” analysis, such as
door-to-door surveys. In a multistage random sample, a large area, such as
a country, is first divided into smaller regions (such as states), and a random
sample of these regions is collected. In the second stage, a random sample of
smaller areas (such as counties) is taken from within each of the regions chosen
in the first stage. Then, in the third stage, a random sample of even smaller
areas (such as neighborhoods) is taken from within each of the areas chosen
in the second stage. If these areas are sufficiently small for the purposes of
the study, then the researcher might stop at the third stage. If not, he or she
may continue to sample from the areas chosen in the third stage, etc., until
appropriately small areas have been chosen.
Example.
(5) Placing the two pages of scores side-by-side, use SRS to choose 6 of the
50 rows of 6 Verbal scores, and the use SRS to pick two Verbal scores from each
of the 6 rows.
(d) Convenience sampling is to choose members of the target population by
taking those most readily available. This is not good sampling, though often
biologists and medical researchers are forced to use this kind of sampling.
Example.
(6) Take the first 12 verbal scores on the sheets.
(7) Choose a first score by SRS, and then also take the next 11. Every
individual has an equal chance of being taken, but groups of 12 do not.
38
(e) the sampled population is not similar to the target population.
Example.
(8) Take 12 verbal scores by SRS from the first sheet.
(9) Take 12 verbal scores by SRS from the second sheet.
Download