ELEMENTARY STATISTICS Study Guide

advertisement
ELEMENTARY STATISTICS
Study Guide
Dr. Shinemin Lin
Table of Contents
1. Introduction to Statistics
2. Descriptive Statistics
3. Probabilities and Standard Normal Distribution
4. Estimates and Sample Sizes
5. Hypothesis Testing
6. Correlations and Regression
7. Analysis of Variance
8. Statistical Process Control
Project 1
Collecting Data
There are many factors that influence the complexity of the written words, factors
such as subject matter, overall length of discussion, choice of vocabulary, and sentence
structure. To simplify the question I propose to look at the length of the sentence in
articles written for the national newspaper, and local paper. You are going to collect
sentences from local and national newspapers and record the length of each sentence (
and the complexity of the sentence.)
This project requires you to do random sampling or pseudo random sampling to
obtain your sample. Concentrating on the following questions:
1. How will you accomplish this?
2. How will you measure your variables with as much reliability and as little bias as
possible?
3. How will you collect your data?
4. Is your data collection plan unbiased bias?
5. What descriptive statistics do you plan to compute for what variables?
6. What graphs and tables do you plan to display?
7. What inferences do you want to make about your population from the sample you
observe?
Chapter 1 Introduction to Statistics
The word STATISTICS has two basic meaning. We sometimes use this word when
referring to actual numbers derived from data. A second meaning refers to statistics as a
method of analysis.
A statistical research usually consists of data collection, data presentation, data analysis
and decision-making.
In statistics, we commonly use the term's population and sample. We investigate sample
to predict population.
A population is the complete collection of elements to be studied.
A sample is a sub collection of elements drawn from population.
A parameter is a numerical measurement describing some characteristic of a population.
A statistic is a numerical measurement describing some characteristic of a sample.
Natural of Data
• Qualitative data can be separated into different categories that are distinguished by
some nonnumeric characteristic.
• Quantitative data consist of numbers representing counts and measurements.
♦ Discrete data result from either a finite of possible values or a countable number
of possible values. When data represent counts, they are discrete.
♦ Continuous numerical data result from infinitely many possible values that can be
associated with points on a continuous scale in such a way that there are no gaps
or interruptions. When data represent measurement, they are continuous.
Levels of measurement of Data
The nominal level of measurement is characterized by data that consist of names, labels,
or categories only. The data cannot be arranged in an ordering scheme.
Example. Voter distribution: 45 democrats, 80 Republicans, 90 Independents.
The ordinal level of measurement involves data that may be arranged in some order, but
difference between data values either cannot be determined or are meaningless.
Example. Voter distribution: 45 low-income voters, 80 middle-income voters, 90 upperincome voters. An order is determined by 'low, middle, upper'.
The interval level of measurement is like the ordinal level, with the additional property
that meaningful amounts of differences between data can be determined. However, there
is no natural zero starting point
Example. Temperatures of steel rods: 45 F, 80 F, and 90 F. 90 F is not twice as hot as 45
F.
The ratio level of measurement is the interval level modified to include the natural zero
starting point. For value of this level, differences and ratios are meaningful.
Example. Length of steel rods: 45 cm, 60 cm, and 90 cm. 90 cm is twice as long as 45
cm.
Methods of Sampling
Guidelines of data collection
1. Ensure that sample size is large enough for required purpose.
2. If your are obtaining measurements of some characteristic from people, you will get
better results if you do the measuring instead of asking the subject for the value.
3. When conducting a survey, consider the medium to be used.
4. Ensure that the method used to collect data actually results in a sample that is
representative of the population.
Sampling methods
1. Random sampling: members of the population are selected in such a way that each
has an equal chance of being selected.
2. Stratified sampling: we subdivide the population into at least two different
subpopulation that share the same characteristics (such as gender), and then we draw
a sample from each subpopulation.
3. Systematic sampling: we choose some starting point and then select every kth
element in the population.
4. Cluster sampling, we divide the population area into sections (or clusters), randomly
select a few of those sections, and then choose all the members from the selected
sections.
5. Convenience sampling, we simply use results that are readily available.
Homework:
Chapter 2 Descriptive Statistics
Summarizing Data
When beginning an analysis of a large set of values, we must often organize and
summarize the data by developing tables and graphs. We begin with a frequency table.
A frequency table lists categories (or classes) of scores, along with counts (or
frequencies) of number of scores that fall into each category. The construction of a
frequency table is not very difficult, and many statistics software packages can do it
automatically.
•
•
•
•
•
Lower class limits are the smallest numbers that can actually belong to the
different classes.
Upper class limits are the largest numbers that can actually belong to the
different classes.
Class boundaries are the numbers used to separate classes, but without the gaps
created by class limits.
Class marks are the midpoints of the classes.
Class width is the difference between two consecutive lower class limits
Example 1, Construct frequency table (with 4 classes) for the data
80, 68, 84, 86, 85, 77, 64, 81, 93, 94, 97, 93, 89, 82, 76, 75, 83, 90, 83, 84, 92, 94,
90, 92, 91, 84, 81, 84, 79, 80, 80.
Data
Frequency
Cumulative
frequency
relative (%)
frequency
Cumulative
percent
Pictures of Data
1. Histograms use frequency table
2. Pie Charts use relative frequency table
3. Maps
4. Stem and Leaf Plot
Another interesting way of summarizing data is to use what is called a stem and leaf plot.
To illustrate this procedure, let's consider the grades obtained by two classes as follow.
Class I: 56, 64, 73, 72, 84, 98, 80, 86, 75, 68, 46, 78, 75, 91, 63, 84, 79, 69, 76, 58.
Class II: 99, 81, 50, 64, 76, 63, 71, 78, 81, 92, 87, 79, 74, 60, 68, 92, 84, 86, 65, 78.
The first digit serves as the stem, and the second digit as the leaf. For example, the stem
of 46 in Class I is 4, and the leaf is 6. Likewise, 56, and 58 have stems of 5 and leaves of
6 and 8, respectively.
Stem and Leaf Plots
Class I
Stems
4
5
6
7
8
9
Leaves
6
6, 8
4, 8, 3, 9
2, 3, 5, 8, 5, 9, 6
4, 0, 6, 4
8, 1
Class II
Stems Leaves
Complete the stem and leaf plot for class II
Steps in Making a Stem and Leaf Plot
1. Decide on the number of digits in the data to be listed under stems (one-digit, twodigit,..) Usually only one digit is given under leaves and the other digits are listed under
stems.
2. List the stems in a column, for least to greatest.
3. List the remaining digits in each data entry as leaves. (You may wish to order these
data from smallest to largest)
Example, Construct the stem-and-leaf plot for the data
80, 68, 84, 86, 85, 77, 64, 81, 93, 94, 97, 93, 89, 82, 76, 75, 83, 90, 83, 84, 92, 94,
90, 92, 91, 84, 81, 84, 79, 80, 80.
Measures of Central Tendency
A measure of central tendency is a value at the center or middle of a data set. Mostly we
will like to measure mean, weighted mean, mode, median, midrange and skewness.
_
Mean = sum / counts X
Weighted mean =
∑w x
∑n
i
=
∑x
n
i
i
Median: The median of a set of scores is the middle value of the sorted data.
Mode: The mode of a data set is the score that occurs most frequently.
Midrange = (highest + lowest)/2
Examples
1. Given data list 5, 5, 5, 3, 1, 1, 5, 4, 3, and 5. Find the mean, mode, and median.
2. How to find mean, mode, and median if you are given a frequency table.
Skewness: A distribution of data is skewed if it is not symmetric and extends more to
one side than the other.
Skewed to the left (mean < median < mode)
Symmetric (mean = mode = median)
Skewed to the right (mode < median < mean)
Measures of Variation
Range = largest - smallest
−
Variance =
∑ ( x − x)
2
/(n − 1) = s2
_
SD =
∑ ( X − X )2
(n − 1)
=s
The amount of deviation is the amount of difference between score and the mean.
Example
Find the variance and SD of the data 6.5, 6.6, 6.7, 6.8, 7.1, 7.3, 7.4, 7.7, 7.7, 7.7.
Calculate SD from a frequency table
Example.
Find the (a) mean and (b) standard deviation of the data described below.
X
|
Frequency
------------------------------4
|
10
------------------------------5
|
7
------------------------------6
|
3
------------------------------7
|
2
Range Rule of THUMB
Range is closed to 4s and hence s can be approximated by (range /4)
Interquartile Range (IQR)
IQR = Q3 - Q1
Example, The following are 16 grades received on a test, arranged in increasing order.
Find the mean, Q1, Q3, and IQR.
Boxplots display Q1, Median and Q3.
Outlier
An outlier is any data point father than 1.5 IQRs above Q3 or father than 1.5 IQRs below
Q1
Measure of position
x−µ
σ
The standard score, or z score, is the number of standard deviation that a given value x is
above or below the mean.
Z scores =
Percentile = cumulative percent
Example, Two equivalent IQ tests are given to similar groups, but the tests are designed
with different scales. The statistics for the tests are listed below. Which is better: a
score of 130 on test A or a score of 52 on test B?
Test A: mean = 100, s = 15;
Test B: mean = 40, s = 5.
Solution.
Chapter 3 Probability and Standard Normal Distribution
Probability of a single even.
Pr(E) = k/n = number of success/number of possible outcomes.
Example.
1) If we draw a ball from a bag containing 4 white balls and 6 black balls, what is the
probability of a) getting a white ball?
b) Getting a black ball?
c) Not getting a white ball?
2) A dice is rolled. What is the probability that
a) A 4 will result?
b) An old number will result?
c) An number bigger than 4 will result?
Sample space. A set that contains all possible outcomes of an experiment is called a
sample space. Each element of the sample space is called a sample point, and an event is
a subset of the sample space.
Examples.
1) Write the sample space and all events of the example 2 above.
2) Suppose a coin is tossed 3 times. Construct the sample space for the experiment
and the event of getting at least 2 heads.
3) Ten blank cards are marked with the numbers 1 to 10. An experiment consists of
shuffling the cards and then drawing one card.
a) Determine the sample space for the experiment.
b) How many sample points are in the sample space?
c) What is the event getting a card with an even number?
Pr(event) = #event / # sample points
Addition rule: the probability of obtaining any one of several different and distinct
outcomes equals the sum of their separate probability. The addition rule always assumes
that the outcomes being considered are mutually exclusive
Multiplication Rule: the probability of obtaining a combination of independent
outcomes equals the product of their separate probabilities.
Examples
Flip two fair coins, what is the probability to get
a) Two heads
b) One head and one tail
Example 1.
Draw one card at random froma standard deck of cards. The sample spaces S is the
collection of the 52 cards. Assume that the probability set function assigns 1/52 to each
of these 52 outcomes. Let A = { x: x is a jack, queen, or king}
B = {x: x is a 9, 10, or jack and x is red},
C = {x: x is a club},
D = {x: x is a diamond, a heart, or a spade}
Find a) P(A), b) P(A and B), c) P(A or B), d) P(C or D), and e) P(C and D)
Example 2.
If P(A) = 0.4, P(B) = 0.5, and P(A and B) = 0.3, find P(A or B), and P(A and B').
Independent and Dependent events
If the occurrence of one event affects the occurrence of the other, the events are said to be
dependent. If the occurrence of one event does not affect the occurrence of the other, the
events are called independent.
If E and F are any two events, then the probability that both events occur, denoted
Pr(EF), is given by Pr(E*F) = Pr(E)*Pr(F|E), where Pr(F|E) is the probability that F
occurs, given that E has occurred. We call Pr(F|E) a conditional probability.
Examples.
1) Two card are drawn from regular deck of cards (without replacement).
a) The probability of first card king is 4/52
b) The probability of second card king given first card king is 3/51
c) The probability of both cards king is 4/52 * 3/51 = 1/221.
d) The probability of both hearts.
e) The probability of a heart at first draw, a club on the second draw.
f) The probability of a heart on the first draw; an ace on the second draw.
2) From a deck of 52 cards two cards are drawn, one after another without replacement.
What is the probability that
(a) the first will be king and the second will be a jack?
(b) the first will be king and the second will be jack in the same suit?
•
Suppose that we are given 20 tulips that are very similar in appearance and told that 8
tulips will bloom early, 12 will bloom late, 13 will be red, and 7 will be yellow. If a
bulb is selected at random, find a) the probability that it will produce a red tulip, b)
the probability that it will be red and that will bloom early.
The normal distribution
If we modify some line graphs to indicate probability rather than frequency, the resulting
graphs closely approximate a smooth, bell-shaped curve called the normal probability
curve.
1. The area under a normal curve is equal to 1.
2. The normal curve is symmetric about a vertical line through the mean of the set of
data.
3. The interval extending from 2 SDs to the left of the mean to 2 SDs to the right of the
mean contains approximately 95% of all data.
4. If x is a data value from a normally distributed set of data, then the probability that x
is greater than a and less than b is the area under the normal curve between a and b.
Finding Probabilities when given z scores
Prob(a < z < b) = the probability that the z score is between a and b.
Prob(z > a) = the probability that the z score is greater than a.
Prob(z < a) = the probability that z score is less than a.
Using Z-distribution Table
Examples
1. Assume that IQ scores are normally distributed with a mean of 100 and a standard
deviation of 15. An IQ scores is randomly selected from this population. Find the
indicated probability.
a) P(100 < x < 130)
b) P( x < 125)
c) P( x > 85)
d) P( 85 < x < 115)
2. If IQ scores are normally distributed with a mean of 100 and a standard deviation of
15, find the probability of randomly selecting a person with an IQ score between 100 and
130?
Finding z scores when given Probabilities
Examples
1. Use the same thermometers with temperature readings that are normally distributed
with a mean 0 C and a standard deviation of 1 C. Find the temperature corresponding to,
the 95th percentile, and 90th percentile.
2. The Chemco Company, which manufactures car tires, finds that the tires last distance
that are normally distributed with a mean of 35600 miles and a standard deviation of
4275 miles. The manufacturer wants to guarantee the tires so that only 3% will be
replaces because of failure before the guaranteed number of miles. For how many miles
should the tires be guaranteed?
The central Limit Theorem
As the sample size increases, the sampling distribution of sample means approaches a
normal distribution. i.e.. the mean of means is almost equal to population mean
( µ _ = µ ); and the standard deviation of the sample means will be σ
X
n
Central Limit Theorem
Given:
1. The random variable x has a distribution with mean µ and standard deviation σ .
2. Samples of size n are randomly selected from this population.
Conclusion:
_
1. The distribution of the sample means Xs will, as the sample size n increase,
approaches a normal distribution.
2. The mean of the sample means will be the population mean µ .
3. The standard deviation of the sample means will be σ
n
Practical Rules Commonly used:
1. For sample size n larger than 30, the distribution of the sample means can be
approximated reasonably well by a normal distribution. The approximation gets
better as the sample size n becomes larger.
2. If the original population itself is normally distributed, the sample means will be
normally distributed for any sample size n.
Example
Assume that the population of human body temperatures has a mean of
98.6 F, as is commonly believed. Also assume that the population standard deviation is
0.62 F. If a sample of size 106 is randomly selected, find the probability of getting a
mean of 98.2 F or lower.
Solution.
µ _ = µ = 98.6
x
0.62
σ
=
= 0.060
x
n
106
z = (98.2 - 98.6)/0.060 = -6.67
P( z < -6.67) = 0.0001.
σ_ =
Chapter 4 Estimates and Sample sizes
A point estimate is a single value used to approximate a population parameter. As an
_
example, the sample mean X is the best point estimate of the population mean µ
A confidence interval (or interval estimate) is a range (or an interval) of values that likely
to contain the true value of the population parameter.
A confidence interval is associated with a degree of confidence. The degree of
confidence is the probability 1 - α that the population parameter is contained in the
confidence interval. This probability is often expressed as the equivalent percentage
value. The degree of confidence is also referred to as the level of confidence or the
confidence coefficient.
•
Common choices for the degree of confidence are 95% ( α = 0.05), and 99% ( α =
0.01).
Notation: Z α = positive z value that is at the vertical boundary for the area of α / 2 in
2
the right tail of the standard normal distribution.
A critical value is the number on the borderline separating sample statistics that are likely
to occur from those that are unlikely to occur. The number Z α is a critical value with
2
the property that the size of the area under the curve bounded by - Z α and Z α is 1 - α
2
Example
Given a 95% degree of confidence, find the critical value Z α
2
2
When sample data are used to estimate a population mean µ , the margin of error,
denoted by E, is the maximum likely (with probability 1 - α ) difference between the
σ
observed sample mean and the true value of the population mean µ . E = Z α *
2
n
If n > 30, we can replace σ by the sample standard deviation s.
_
_
Confidence interval for the population mean µ is ( X − E , X + E ) .
Example
For a 95% degree of confidence, find the confidence interval for
_
population mean given the statistics n = 106, X = 98.2 and s = 0.62.
Small Sample Cases and the Student t distribution
If n < 30 and the population standard deviation is unknown then we can not use the
previous formula to find out the confidence interval for the population mean. It this is the
case we can apply the student t-distribution.
Student t-distribution
_
X−µ
If the distribution of a population is essentially normal, then the distribution of t =
s
n
is essentially a Student t-distribution for all samples of size n.
Using t-distribution Table
The number of degrees of freedom for a data set corresponds to the number of scores that
can vary after certain restrictions have been imposed on all scores. DF = n - 1.
Important facts of the t-distribution
1. The t-distribution is different for different sample size.
2. The t-distribution has the same general symmetric bell shape as the normal
distribution, but it reflects the greater variability that is expected with small sample.
3. The t-distribution has a mean of t = 0.
4. The standard deviation of The t-distribution varies with the sample size, but it is
greater than 1.
5. As n gets larger, the t-distribution gets closer to normal distribution.
When do we use the t-distribution?
(1) The sample is small (n <= 30);
(2) σ is unknown; and
(3) The population has a distribution that is essentially normal.
_
_
The confidence interval will be ( X − E , X + E ) where E = tα / 2 *
s
n
Example. Suppose that we have only the following 10 randomly selected body
temperatures.
98.6, 98.6, 98.0, 99.0, 98.4, 98.4, 98.4, 98.6, 98.4, 98.0
Construct the 95% confidence interval for the mean of all body temperatures. (Assume
that body temperatures are normally distributed.)
Zα / 2 * σ 2
] and round up. If we don't know the
E
population standard deviation, we can go ahead using s instead of σ .
What is the appropriate sample size? n = [
Example. We want to estimate the mean weight of plastic discarded by households in
one week. How many households must we randomly select if we want to be 99% sure
that the sample mean is within 0.25lb of the true population mean Assume that
σ = 1.10lb .
Estimating a Population Variance
In a normally distributed population with variance σ 2 , we randomly select independent
sample of size n and compute the sample variance s 2 for each sample. The sample
statistic χ 2 = (n − 1) s 2 / σ 2 has a distribution called the Chi-square distribution with DF
= n - 1.
Properties of Chi-square distribution.
1. The Chi-square distribution is not symmetric.
2. The value of chi-square can be zero or positive, but cannot be negative.
3. The chi-square distribution is different for each number of degrees of freedom. As
the number of DF increases, the chi-square distribution approaches to a normal
distribution.
Example. Find the critical values of χ 2 that determine critical regions containing an area
of 0.025 in each tail. Assume that the relevant sample size is 10.
•
The sample variance s 2 is the best point estimate of the population variance
 (n − 1) s 2 (n − 1) s 2
,
The confidence interval of population variance is 
2
χ
χ L2
R


 .

Question: What is the confidence interval of population standard deviation?
Example. The following IQ scores are obtained from a randomly selected sample.
85
91
93
99
103
111
115
122
92
a) Find the best point estimate of the population variance.
b) Construct a 95% confidence interval estimate of the population standard deviation.
Chapter 5
Hypothesis Testing
In previous chapter we studied how to use sample statistics to estimate values of
population parameters. In this chapter we study how to use sample statistics to test
hypotheses made about population parameters.
In statistics, a hypothesis is a statement that something is true.
Components of a Formal Hypothesis Test
1. The null hypothesis (H0) is a statement about the value of a population parameter,
and it must contain the condition of equality.
2. The alternative hypothesis (H1) is the statement that must be true if the null
hypothesis is false.
Hypothesis testing is not simply a matter of being right or wrong. Different types of
errors can have dramatically different consequences.
•
Type I error: The mistake of rejecting the null hypothesis when it is true. This type
error is not a miscalculation or procedural misstep; it is an actual error that can occur
when a rare event happens by chance. The probability of rejecting the null hypothesis
when it is true is called significance level; that is, the significance level is the
probability of type I error. The symbol α is used to represent the significance level.
The values of 0.05 and 0.01 are common used.
•
Type II error: This mistake of failing to reject the null hypothesis when it is false.
The symbol β is used to represent the probability of a type II error.
True State of Nature
Reject H0
Fail to reject H0
H0 is true
Type I error
Correct Decision
H0 is false
Correct Decision
Type II error Type II error
3. Test Statistic: A sample statistic or a value based on the sample data. A test statistic
is used in making decision about rejection of the null hypothesis.
4. Critical Region: The set of all values of the test statistic that would cause us to reject
the null hypothesis.
5. Critical value: The value or values that separated the critical region from the value
of the test statistic that would not lead to rejection of the null hypothesis. The critical
values depend on the natural of the null hypothesis, the relevant sampling
distribution, and the level of the significance
6. Conclusion:
a) Fail to reject the null hypothesis H0
b) Reject the null hypothesis
Example.
• Original claim: A medical researcher claims that the mean body temperature
of a healthy adults is not equal to 98.6 F.
• Hypotheses: H0: µ = 98; H1: µ ≠ 98.6.
• Significant level: α = 0.05
_
•
•
•
X−µ_
Test statistic: z =
X
=
98.2 − 98.6
= -6.64
σ/ n
0.62 * 106
Critical region: It consists of values of the statistic less than z = -1.96 or
greater than z = 1.96.
Critical value: The critical values are z = -1.96 and z = 1.96.
The following practical considerations may be relevant:
1) For any fixed α , an increase in the sample size n will cause a decrease in β . That is
a larger sample will lessen the chance that you fail to reject a false null hypothesis.
2) For any fixed size n, a decrease in α will cause an increase in β . Conversely, an
increase in α will cause a decrease in β .
3) To decrease both α and β , increase the sample size.
Summary
Start => Does the original claim contain the condition of equality?
If the answer is yes => original claim becomes H0.
Do you reject H0
If yes => There is sufficient evidence to warrant rejection of the claim.
If no => There is not sufficient evidence to warrant rejection of the claim.
If the answer is no => original claim becomes H1.
Do you reject H0?
If yes => The sample data support the claim.
If no => There is not sufficient sample evidence to support the
claim.
Two-tailed test: H1 ≠
Left-tailed test: H1 <
Right-tailed test: H1 >
Example. After analyzing 106 body temperatures of healthy adults, a medical researcher
makes a claim that the mean body temperature is less than 98.6 degree F.
a)
b)
c)
d)
e)
f)
g)
Express the claim in symbolic form:
Identify the null hypothesis:
Identify the alternative hypothesis:
Identify this test as being two-tailed, left-tailed, or right-tailed:
Identify the type I error:
Identify the type II error:
Assume that the conclusion is to reject the null hypothesis. State the conclusion
in no technical terms.
h) Assume that the conclusion is failure to reject the null hypothesis. State the
conclusion in no technical terms.
Testing a claim about a mean: Large Samples
_
Test Statistic for claims about When n > 30: Z =
X−µ_
X
σ/ n
Traditional Method of Hypothesis Testing:
1. Identify the specific claim or hypothesis to be tested and put it in symbolic form
2. Give the symbolic form that must be true when the original claims id false.
3. Of two symbolic expressions obtained so far, let the null hypothesis H0 be the one
that contains the condition of equality, H1 is another statement.
4. Select the significance level α based on the seriousness of a type I error. Make α
small if the consequences of rejecting a true H0 are severe. The value 0.05 or 0.01 is
very common.
5. Identify the statistic that is relevant to this test and its sampling distribution.
6. Determine the test statistic, the critical values, and the critical region. Draw a graph
and include the test statistic, critical value(s), and critical region.
7. Reject H0 if the test statistic is in the critical region. Fail to reject H0 if the test
statistic is not in the critical region.
8. Restate this previous decision in simple no technical terms.
•
That fail to reject H0 does not equivalent to say support H0.
_
Example. Using the sample data given at the beginning of the chapter (n = 106, X =
98.2, s = 0.62) and a 0.05 significance level, test the claim that the mean body
temperature of healthy adults is equal to 98.6 F. Use the traditional method by following
the procedure outlined above.
The p-value method of testing hypothesis
Many professional articles and software packages use another approach to hypothesis
testing that is based on the calculation of a probability value, or p-value.
A p-value is the probability of getting a value of the sample test statistic that is at least as
extreme as the one found from the sample data, assuming that the null hypothesis is true.
p-value can be found at Table A3.
P-values measure how confident we are in rejecting a null hypothesis. For example, a Pvalue of 0.0002 would lead us to reject null hypothesis, but it would also suggest that the
sample results are extremely unusual if the claimed value of µ is in fact correct.
P-value approach uses most of the same basic procedures as the traditional approach, but
step 6 and 7 are different:
Step 6: Find p-value
Step 7: Report p-value. Some statisticians prefer to simply report the p-value and leave
the conclusion to the reader. Others prefer to use the following decision criterion:
• Reject H0 if the p-value is less than or equal to the significance level.
• Fail to reject H0 if the p-value is greater than the significance level.
If the conclusion is based on the p-value alone, the following guide may be helpful:
Less than 0.01: Highly statistically significant;
Very strong evidence against the null hypothesis
0.01 to 0.05: Statistically significant
Adequate evidence against the null hypothesis
Greater than 0.05: Insufficient evidence against the null hypothesis
Example. Use the p-value method to test the claim that the mean body temperature of
healthy adults is equal to 98.6 F. As before, use a 0.05 significance level and the sample
data from previous example.
Testing a Claim about a Mean: Small samples
If the sample size is small than 30, the population standard deviation is unknown, and the
population is essentially normally distributed then we use t-distribution to test our
hypothesis.
_
X−µ_
Test Statistic = t =
X
s/ n
Example. In one part of a test developed by a psychologist, the test subject is asked to
form a word by unscrambling the letters 'ciiatttsss'. Given below are the times (in
seconds) required by 15 randomly selected persons to unscramble the letters. Test the
claim that the mean time is equal to 60 seconds at the 0.05 level of significance.
68.7, 27.4, 26.0, 60.5, 34.6, 61.1, 68.6, 48.4, 43.6, 39.5, 85.3, 26.3, 43.4, 83.7, 68.9.
Testing a Claim about a Standard Deviation or Variance.
In testing a hypothesis made about a population standard deviation and variance, we
assume that the population has values that are normally distributed.
Test Statistic for testing hypothesis about standard deviation or variance
χ 2 = (n − 1) s 2 / σ 2 , where n = sample size; s 2 = sample variance;
and α 2 = population variance(given in the H0)
Example. With individual lines at its various windows, the Jefferson Bank found that the
standard deviation for normally distributed waiting times on Friday afternoon was 62
min. The bank experimented with a single main waiting line and found that for a random
sample of 25 customers, the waiting times have a standard deviation of 3.8 min. based on
previous studies, we can assume that the waiting times are normally distributed. At the
= 0.05 significance level, test the claim that a single line causes lower variation among
the waiting times.
Chapter 6 Correlation and Regression
In this chapter involves estimating parameters and testing hypothesis, but the method s
we will use are different because of the very different issue we will be considering: given
paired data, we want to investigate the relationship between the two variables. Specially,
we want to determine whether there is a relationship between the two variables and, if so,
identify what the relationship is. We begin by considering the concept of correlation.
We also investigate regression analysis.
A correlation exists between two variables when on of them is related to the other in
some way.
The Minitab provides a scatter diagram, which is a plot of paired (x, y) data with a
horizontal x-axis and a vertical y-axis. We can find out the general pattern of those
paired data sometimes.
The linear correlation coefficient r measures the strength of the linear relationship
between the paired x and y values in a sample. Its value is computed by using the
formula
r=
n∑ xy − (∑ x)(∑ y )
n( ∑ x 2 ) − ( ∑ x ) 2 n ( ∑ y 2 ) − ( ∑ y ) 2
_
=
_
∑ ( x − x)( y − y)
(n − 1) s x s y
r is a sample statistic. We might think of r as a point estimate of the population
parameter, which is the linear correlation coefficient for all pairs of data in the
population.
Example. Use Table 6.1, find the value of the linear correlation coefficient r. (r = 0.842)
Table 6.1
Data from the Garbage Project
x Plastic (lb)
| 0.27
1.41 2.19 2.83 2.19 1.81 0.85 3.05
y household size | 2
3
3
6
4
2
1
5
After calculating r, how do we interpret the result?
If r is close to zero, we conclude that there is no significant linear correlation between x
and y.
Properties of r
1.
2.
3.
4.
r is always between -1 and 1.
r does not change if all values of either variables are converted to a different scale.
r is not affected by the choice of x or y.
r measures the strength of a linear relationship.
Hypothesis Test of the Significance of r
H1: ρ ≠ 0
H0: ρ = 0;
For the test statistic, we use one of the following methods.
Method I: Test Statistic is t
(r − µ r )
r
t=
=
; since we assume that ρ = 0 , it follows that µ r = 0 Also,
sr
1− r2
n−2
it can be shown that the standard deviation of linear correlation coefficients, can be
expresses as
(1 − r 2 ) /(n − 2) .
.
Critical value: Use Table A-3 with degrees of freedom = n-2.
Method 2: Test Statistic is r
Critical values: refer to Table A-6.
Example. Using the sample data in Table 6.1, test the claim that there is a linear
correlation between weights of discarded plastic and household sizes use method 1.
Common Errors Involving Correlation
1. We must be careful to avoid conducting that a significant linear correlation between
two variables is proof that there is a cause-effect relationship between them.
2. Another source of potential error arises with data based on rates or averages. If we
suppress the variation of individuals, it may lead to an inflates correlation coefficient.
3. A third error involves the property of linearity. The conclusion that there is no
significant linear correlation coefficient does not mean that x and y are not related in
any way.
Regression
Our goal in this section is to identify the relationship between variables so that we can
predict the value of one variable, given the value of the other variable.
Given a collection of paired sample data, the regression equation
describes the relationship between the two variables. The graph is
Yˆi = b0 + b1 X i
called the regression line or line of best fit, or least-squares line.
n
b1 =
∑ X Y − nXY
i =1
n
∑X
i =1
i i
2
i
− n(X )
2
b0 = Y − b1 X
Notation of Regression Equation
Population parameter |
Point Estimate
--------------------------------------------------------------------------------------------------y-intercept of regression line
b0
b0'
Slope of regression line
b1
b1'
Equation of the line
y = b0 + b1x
y = b0' + b1' x'
Example. Use Table 6.1 data, find the regression equation of the straight line that relates
x and y. (y = 0.549 + 1.48x)
Predictions
In predicting a value of y based on some given value of x..
_
1. If there is not a significant linear correlation, the best predicted y value is y .
2. If there is a significant linear correlation, the best predicted y value is found by
substituting the x value into the regression equation.
Example. Use the previous regression equation y = 0.549 + 1.48x to predict the size of a
household that discards 2.50 lb of plastic in a week.
Solution.
y = 0.549 + 1.48(2.50) = 4.25
Guidelines for Using the Regression Equation
• If there is no significant linear correlation, don't use the regression equation to make
prediction
• When using the regression equation for prediction, stay within the scope of the
available sample data.
• A regression equation based no old data is not necessarily valid now.
• Don't make predictions about a population that is different from the population from
which the sample data were drawn.
Chapter 7 Analysis of Variance
In Chapter 5 we developed procedures for testing the hypothesis that two population
means are equal. In this chapter we will develop a procedure for testing the hypothesis
that three or more population means are equal.
Analysis of Variance (ANOVA) is a method of testing the equality of three or more
population means by analyzing sample variances.
The ANOVA methods use F-distribution.
Assume that two populations are independent of each other and are normally distributed
then
s12
F(n,m) = 2 is a F-distribution with degrees of freedom n-1 ,m-1.
s2
Properties of F-distributions
1. The F distribution is not symmetric; it is skewed to the right.
2. The value of F can be zero or positive, but they cannot be negative.
3. There is a different F distribution for each pair of degrees of freedom for the
numerator and denominator.
In this chapter we assume that
1. The population has normal distribution
2. The population has the same variance.
3. The samples are random and independent of each other.
One-Way ANOVA with Equal Sample Sizes.
Notation for One-Way ANOVA with Equal Sample Sizes
n = size of each sample
k = number of samples
S _2 = Variance of the sample means
x
2
p
S = Pooled variance obtained by calculating the mean of the sample variances.
H0: µ1 = µ 2 = µ 3
H1: one of the equalities does not hold.
The variance between samples (variation due to treatment) is an estimate of σ 2 based on
the sample means.
Variance between samples = ns _2 where S _2 = variance of the sample means
x
x
The variance within samples (variation due to error) is an estimate of σ 2 based on the
sample variances. With all samples of the same size n,
Variance within samples = S p2
= pooled variance obtained by finding the mean of the sample variance.
Test Statistic for One-Way ANOVA with Equal Sample sizes
F = ns _2 / S p2
x
numerator degrees of freedom = k-1
denominator degrees of freedom = k(n-1)
The critical value of F is F(k-1, k(n-1))
Example
Do different age groups have different body temperature? Table 7-3 lists
the body temperatures of 5 randomly selected subjects from each of 3 different age
groups. Informal examination of 3 sample means (97.940, 98.580, 97.800) seems to
suggest that the 3 samples come from populations with means that are not significantly
different. In addition to the values of the 3 sample means, however, we should consider
their standard deviations and the sample sizes. We need to conduct a formal hypothesis
test to determine whether the sample means are significantly different. Using a
significance level of 0.05, we will test the claim that the 3 age-group populations have the
same mean body temperature.
Table 7-1
18 - 20
98.0
98.4
97.7
98.5
97.1
n1 = 5
Body Temperature (Categorized by Age)
21 - 29
99.6
98.2
99.0
98.2
97.9
n2 = 5
_
_
30 and older
98.6
98.6
97.0
97.5
97.3
n3 = 5
_
X 1 = 97.940 X 2 = 98.580
X 3 = 97.800
s1 = 0.568
s3 = 0.752
s2 = 0.701
Solution.
Step 1 and Step 2 (omit)
Step 3: Ho: µ1 = µ 2 = µ 3 ; H1: Three means are not all equal.
Step 4: Significance level = 0.05
Step 5: Because we test the claim that 3 or more population means are equal, we use
ANOVA with an F test statistic.
Step 6: For one-way ANOVA with equal sample sizes, the test statistic (F = 1.8803) is
calculated as following.
The critical value of F = 3.8855 is found by referring to the table for which α =
0.05. The degrees of freedom are as follows:
numerator degrees of freedom = k - 1 = 3 - 1 = 2
denominator degrees of freedom = k(n - 1) = 3(5 - 1) = 12
Step 7: Because the test statistic of F = 1.8803 does not fall in the critical region bounded
by F = 3.8853, we fail to reject the null hypothesis of the 3 means are equal.
Step 8: There is not sufficient evidence to warrant rejection of the claim that the 3
populations of different age groups have the same mean body temperature.
Perhaps there really is a difference, but the sample size is too small and/or the
sample differences are not large enough to justify that conclusion.
One-Way ANOVA with Unequal Sample Sizes
Notions
=
X = overall mean ( sum of all sample scores divided by the total number of
scores)
k = number of population means being compared
ni= number of values in the ith sample
N = total number of values in all sample combined (N =
_
X i = mean of values in the ith sample
S i2 = variance of values in the ith sample
Using the preceding notation, we can now express the test statistic as follows:
F =( variance between samples )/ (variance within samples)
=
The numerator is really a form of the formula
Key components in our ANOVA method are listed below.
SS(total) = total sum of squares
= a measure of the total variation (around overall mean) in all of the
sample data combined.
= = SS(treatment) + SS(error)
SS(treatment) = a measure of the variation between the sample means
= SS(between groups)
=
SS(error) = sum of squares representing the variability that is assumed to be
common to all the population being considered.
=
Example. Table 7-2 includes sample data with movie lengths arranged according to the
numbers of stars the movies were given. Use the data in Table 7-2 to find the values of
SS(treatment), SS(error), and SS(total).
Table 7.2 Lengths (in minutes) of Movies Categorized by Star Ratings
Poor
Fair
0.0-1.5 Stars
105
108
96
91
2.0 - 2.5 Stars
110
114
98
100
96
123
101
92
155
92
155
92
99
100
108
Good
3.0 - 3.5 Stars
93
123
115
97
133
104
94
82
94
98
106
107
93
95
129
94
102
117
90
104
102
117
90
104
104
119
105
96
139
134
111
100
111
Solution
k = 4 (number of samples)
mean of all 60 sample scores = 6630/60 = 110.5000
SS(treatment) =
= 4113.1122.
SS(error)
=
= 25466.0514
Excellent
4.0 Stars
72
120
120
104
159
125
103
160
193
168
193
168
88
121
144
90
SS(total)
=
= 29579.1636
SS(treatment) and SS(error) are both sums of squares, and if we divide each by its
corresponding number of degrees of freedom, we get mean squares, as defined below.
MS(treatment) is a mean square for treatment, obtained as follows:
MS(treatment) = SS(treatment)/(k-1)
MS(error) is a mean square for error, obtained as follows:
MS(error) = SS(error)/(N - k)
MS(total) is a mean square for the total variation, obtained as follows:
MS(total) = SS(total)/(N-1)
Example.
Use the sample in Table 7-2 to find the values of MS(treatment),
MS(error), and MS(total).
Solution.
MS(treatment ) = SS(treatment)/(k-1) = 4113.1122/(4 - 1) = 1371.0374
MS(error) = SS(error) / (N - k) = 25466.0514/(60 - 4) = 454.7509
MS(total) = SS(total) /(N - 1) = 29579.1636/(60 - 1) = 501.3418.
Test Statistic for ANOVA with Unequal Sample Sizes
H0: All means are equal
The test statistic
H1: these means are not all equal
F = MS(treatment)/MS(error)
The critical value = F(k-1, N-k)
Example. Are bad movies as long as good movies, or does it just seem that way? Refer
to the sample data given in Table 7.2. Examination of summary statistics seems to
suggest that there are diifferences in the mean length of movies, with movies rated as
excellent tending to be longer. But are those differences significant? Test the claim that
the 4 categories of movies have the same mean length. That is, test the claim that .
Solution
H0 :
H1 : The preceding means are not all equal.
Significant level = 0.05.
Use F distribution
ANOVA Table
Source of Variation
Treatments
Error
Total
SS
4113.1122
25466.0514
29579.1636
Degree of Freedom
3
56
59
MS
1371.0374
454.7509
F
3.0149
Critical Value = 2.7581
Because the test statistic of F = 3.0149 does exceed the critical value F = 2.7581,
we reject the null hypothesis that the means are equal.
There is sufficient sample evidence to warrant rejection of the claim that the 4
population means are equal. It appears the mean movie length is not the same for poor,
fair, good, and excellent movies. It seems that movies rated with 4 stars are longer than
other movies, but we need other methods to formally justify this conclusion.
Chapter 8 Nonparametric Statistics
Download