Uploaded by Franklin Aryee

CHE357

advertisement
CHE357: Experimental Data Analysis
Dr. A.Y Omari Sasu
Kwame Nkrumah University of Science and Technology
Department of Statistics and Actuarial Science
STATISTICS
Statistics is the art of learning from data. It is concerned with the collection of data, their
subsequent description, and their analysis, which often leads to the drawing of conclusions.
Key Concept
In this section we begin with a few very basic definitions, and then we consider an overview of
the process involved in conducting a statistical study. This process consists of “prepare, analyze,
and conclude.” “Preparation” involves consideration of the context, the source of data, and
sampling method. In future chapters we construct suitable graphs, explore the data, and execute
computations required for the statistical method being used. In future chapters we also form
conclusions by determining whether results have statistical significance and practical significance.
Statistical thinking involves critical thinking and the ability to make sense of results. Statistical
thinking demands so much more than the ability to execute complicated calculations. Through
numerous examples, exercises, and discussions, this text will help you develop the statistical
thinking skills that are so important in today’s world.
Basic definitions
A variable is a characteristic or attribute that can assume different values.
Data are collections of observations, such as measurements, genders, or survey responses. (A
single data value is called a datum, a term rarely used.
Data are the values (measurements or observations) that the variables can assume.
Variables whose values are determined by chance are called random variables.
Statistics is the science of planning studies and experiments; obtaining data; and organizing,
summarizing, presenting, analyzing, and interpreting those data and then drawing conclusions
based on them.
A population is the complete collection of all measurements or data that are being considered.
Typically, a population is the complete collection of data that we would like to make inferences
about.
A census is the collection of data from every member of the population.
A sample is a subcollection of members selected from a population.
Because populations are often very large, a common objective of the use of statistics is to obtain
data from a sample and then use those data to form a conclusion about the population.
Example
In the journal article “Residential Carbon Monoxide Detector Failure Rates in the United States”
(by Ryan and Arnold, American Journal of Public Health, Vol. 101, No. 10), it was stated that
there are 38 million carbon monoxide detectors installed in the United States. When 30 of them
were randomly selected and tested, it was found that 12 of them failed to provide an alarm in
hazardous carbon monoxide conditions. In this case, the population and sample are as follows:
Population: All 38 million carbon monoxide detectors in the United States
Sample: The 30 carbon monoxide detectors that were selected and tested
The objective is to use the sample data as a basis for drawing a conclusion about the population of
all carbon monoxide detectors, and methods of statistics are helpful in drawing such conclusions.
The body of knowledge called statistics is sometimes divided into two main areas, depending on
how data are used. The two areas are
ο‚·
Descriptive statistics
ο‚·
Inferential statistics
Descriptive statistics consists of the collection, organization, summarization, and presentation of
data.
In descriptive statistics the statistician tries to describe a situation and present the data in some
meaningful form, such as charts, graphs, or tables
Inferential statistics consists of generalizing from samples to populations, performing estimations
and hypothesis tests, determining relationships among variables, and making predictions.
The statistician tries to make inferences from samples to populations. Inferential statistics uses
probability, i.e., the chance of an event occurring.
TRY
Determine whether descriptive or inferential statistics were used.
a. The average price of a 30-second ad for the Academy Awards show in a recent year was
1.90 million dollars
b. The Department of Economic and Social Affairs predicts that the population of Mexico
City, Mexico, in 2030 will be 238,647,000 people.
c. A medical report stated that taking statins is proven to lower heart attacks, but some people
are at a slightly higher risk of developing diabetes when taking statins.
d. A survey of 2234 people conducted by the Harris Poll found that 55% of the respondents
said that excessive complaining by adults was the most annoying social media habit.
Types of Data
Key Concept
A major use of statistics is to collect and use sample data to make conclusions about populations.
We should know and understand the meanings of the terms statistic and parameter, as defined
below.
Basic Types of Data
Definitions
A parameter is a numerical measurement describing some characteristic of a population.
A statistic is a numerical measurement describing some characteristic of a sample.
Example
There are 17,246,372 high school students in the United States. In a study of 8505 U.S. high school
students 16 years of age or older, 44.5% of them said that they texted while driving at least once
during the previous 30 days
1. Parameter: The population size of 17,246,372 high school students is a parameter,
because it is the entire population of all high school students in the United States. If we
somehow knew the percentage of all 17,246,372 high school students who reported they
had texted while driving, that percentage would also be a parameter.
2. Statistic: The sample size of 8505 surveyed high school students is a statistic, because it
is based on a sample, not the entire population of all high school students in the United
States. The value of 44.5% is another statistic, because it is also based on the sample, not
on the entire population.
Quantitative and Qualitative data
Definitions
Quantitative (or numerical) data consist of numbers representing counts or measurements.
Qualitative (or Categorical or attribute) data consist of names or labels (not numbers that
represent counts or measurements).
CAUTION
Categorical data are sometimes coded with numbers, with those numbers replacing names.
Although such numbers might appear to be quantitative, they are actually categorical data.
Example
1. Quantitative Data: The ages (in years) of subjects enrolled in a clinical trial
2. Categorical Data as Labels: The genders (male>female) of subjects enrolled in a clinical
trial
3. Categorical Data as Numbers: The identification numbers 1, 2, 3 . . . . 25 are assigned
randomly to the 25 subjects in a clinical trial. Those numbers are substitutes for names.
They don’t measure or count anything, so they are categorical data.
Discrete / Continuous
Quantitative data can be further described by distinguishing between discrete and continuous
types.
Discrete data result when the data values are quantitative and the number of values is finite, or
“countable.” (If there are infinitely many values, the collection of values is countable if it is
possible to count them individually, such as the number of tosses of a coin before getting tails.)
Continuous (numerical) data result from infinitely many possible quantitative values, where the
collection of values is not countable. (That is, it is impossible to count the individual items because
at least some of them are on a continuous scale, such as the lengths of distances from 0 cm to 12
cm.)
Example
1. Discrete Data of the Finite Type: Each of several physicians plans to count the number of
physical examinations given during the next full week. The data are discrete data because
they are finite numbers, such as 27 and 46, that result from a counting process.
2. Discrete Data of the Infinite Type: Casino employees plan to roll a fair die until the number
5 turns up, and they count the number of rolls required to get a 5. It is possible that the rolls
could go on forever without ever getting a 5, but the numbers of rolls can be counted, even
though the counting might go on forever. The collection of the numbers of rolls is therefore
countable.
3. Continuous Data: When the typical patient has blood drawn as part of a routine
examination, the volume of blood drawn is between 0 mL and 50 mL. There are infinitely
many values between 0 mL and 50 mL. Because it is impossible to count the number of
different possible values on such a continuous scale, these amounts are continuous data.
The classification of variables can be summarized as follows:
variable
quantitative
discrete
qualitative
continuous
Try
Classify each variable as a discrete or continuous variable.
a. The number of hours during a week that children ages 12 to 15 reported that they watched
television.
b. The number of touchdowns a quarterback scored each year in his college football career.
c. The amount of money a person earns per week working at a fast-food restaurant.
d. The weights of the football players on the teams that play in the NFL this year.
Levels of Measurement
Another common way of classifying data is to use four levels of measurement: nominal, ordinal,
interval, and ratio, all defined below.
Level of Measurement
Brief Description
Example
Ratio
There is a natural zero
Heights, lengths, distances,
starting
Volumes
point and ratios make sense
Interval
Differences are meaningful,
Body temperatures in degrees
but there is no natural zero
Fahrenheit or Celsius
starting point and ratios are
meaningless
Ordinal
Data can be arranged in
Ranks of colleges in U.S.
order,
News
but differences either can’t be
& World Report
found or are meaningless.
Nominal
Categories only. Data cannot
Eye colors
be arranged in order.
The nominal level of measurement is characterized by data that consist of names, labels, or
categories only. The data cannot be arranged in some order (such as low to high). An example of
this could be state names, or names of the individuals, or courses by name, gender, race, religion,
or sport. These do not need to be placed in any order. The logical operators are ∪,∩.
Data are at the ordinal level of measurement if they can be arranged in some order, but
differences (obtained by subtraction) between data values either cannot be determined or are
meaningless. The order may be either increasing or decreasing. One example would be income
levels. The data could have numeric values such as 1, 2, 3, or values such as high, medium, or
low. Also, categorical variables that judge size (small, medium, large, etc.) are ordinal variables.
The logical operators are ∪,∩, <, >, = .
Data are at the interval level of measurement if they can be arranged in order, and differences
between data values can be found and are meaningful. Data at this level do not have a natural
zero starting point at which none of the quantity is present. The logical operators for interval
scale are ∪,∩, <, >, +, −, =
Data are at the ratio level of measurement if they can be arranged in order, differences can be
found and are meaningful, and there is a natural zero starting point (where zero indicates that
none of the quantity is present). For data at this level, differences and ratios are both meaningful.
The logical operators for ratio scale are ∪, ∩, <, >, +, −,÷, =
Figure 1: Summary of data types and scale measures
Exercise
Indicate the scale of measurement for the following set of data.
a.
January, February, . . . , June
b.
Single, Married, Divorced
c.
30 ,45,15, 12
d.
10th Feb, 12th Aug, 25th Sep
a.
First class honors, Second class honors, Pass
b.
5cm, 25cm, 10cm, 15 cm
c.
30oC, 37oC, 19oC
d.
Christian, Muslim, Hindu, Jewish
a.
$ 559, $ 870, $ 170
b.
3km, 8km, 4km, 10 km
c.
Breakfast, Lunch, Dinner
d.
CPP, PNDC, NPP, NDC
Numerical Descriptive Measure
• Measure of central tendencies
• Measure of dispersion
• Measure of position
• Measure of shape
Measure of Central Tendencies
These are the averages which determine the central location or middle of the data
The Arithmetic Mean
This is the best known and most commonly used average. Let π‘₯1, π‘₯2, π‘₯3,··· , π‘₯𝑛 be a data set. The
mean is given as
(1)
When the data is grouped in a frequency distribution, xi becomes the class mark or midpoint
of the ith class boundary with frequency, 𝑓𝑖 .
Assumed Mean
The mean may also be computed using
(2)
where
A = Assumed mean di = xi-Ai is called the deviation of ith classmark
Weighted Mean
We sometimes associate with each observation certain weighting factor or weight depending on
the significance attached to the observation. Let the values, π‘₯1 , π‘₯2 , π‘₯3 ,··· , π‘₯𝑛 be the set of data
with weights 𝑀1 , 𝑀2 , 𝑀3 ,··· , 𝑀𝑛 respectively. Then the weighted mean is given as
(3)
Median
The median is the middle-ranked value of an ordered array data. It divides the data set into two
equal parts after the observations have been arranged in order of magnitude. Let π‘₯1 , π‘₯2 , π‘₯3 ,··· , π‘₯𝑛
be the observations arranged in an increasing order of magnitude. The median, denoted M is
defined as
(4)
,
For grouped data, the median is given as
if n is even
(5)
where
Lm = lower class boundary of the median class fcm = cumulative frequency
just before the median class fm = frequency of the median class Cm = class
width of median class boundary n = total number of observation (total
frequency)
Mode
The mode is defined as the most frequent observed value of a given set of observations. For
grouped data, the mode is given as
(6)
where,
LCB = Lower Class Boundary of the modal class
βˆ†1 = Absolute difference between the frequency of modal class and pre modal class
βˆ†2 = Absolute difference between the frequency of modal class and post modal class
C = width of the modal class
Example
1. The following data represent the scores on a statistics examination of a sample of students: 87,
63, 91, 72, 80, 77, 93, 69, 75, 79, 70, 83, 94 , 75, 88. Find the mean and median mark.
2. The average fuel efficiencies, in miles per gallon, of cars sold in the United States in the years
1999 to 2003 were 28.2, 28.3, 28.4, 28.5, 29.0 Find the sample mean of this set of data.
3. The following is a frequency table of the ages of a sample of members of a symphony for young
adults.
Age Value
Frequency
16
9
17
12
18
15
19
8
20
10
a. Find the sample mean of the given ages
4. A company runs two manufacturing plants. A sample of 30 engineers at plant 1 yielded a
sample mean salary of $33,600. A sample of 20 engineers at plant 2 yielded a sample mean
salary of $42,400. What is the sample mean salary for all 50 engineers?
5. A student’s final end of semester examination marks in six courses are: 56, 68, 65, 70, 78, 80.
If the credits for the courses are 4, 3, 3, 4, 3, 2 respectively, determine the approximate average
mark.
6. The distribution below gives measurements on 40 different subjects.
Class Interval
No. of Subjects(xi)
110-119
4
120-129
6
130-139
3
140-149
5
150-159
10
160-169
4
170-179
8
Using an assumed mean of A = 144.5, compute the mean, mode and Median
Measure of Dispersion
The degree to which the numerical data tend to spread about an average is the dispersion or
variation of the data. Numerous measures of dispersion exist, the most commonly being the
range, mean deviation, variance ( or standard deviation), quartile deviation and coefficient of
variation.
Range
The range is the simplest measure of dispersion. The range of set of measurements x1,x2,x3,··· ,xn
is defined as the difference between the largest and smallest measurements. In the case of
grouped data, the range is defined as the difference between the last and the first class marks.
Mean Deviation
The mean deviation (MD) is a measure of the average amount by which the observations,
x1,x2,x3,··· ,xn, differ from the arithmetic mean, x¯ . It is given as
(7)
(8)
Variance and Standard Deviation
The variance of a set of observations x1,x2,x3,··· ,xn, is the average of the squared deviations from
the arithmetic mean. It is denoted by σ2 and S2, population and sample data respectively. The
variance is given as
(9)
(10)
For grouped data the variance is given as;
(11)
(12)
Note
The standard deviation is defined as the positive square root of the variance.
Co-efficient of Variation
The standard deviation is useful as a measure of dispersion within a given set of data.
Sometimes, we may be interested in comparing variations between two or more sets of data. The
standard deviation or the variance can be used for this purpose when the variables are given in
the same units and are such that their means are approximately equal. For instance, comparing
the distributions of annual incomes and absenteeism for a group of employees. In order to make
a meaningful comparison of the dispersion in incomes and absenteeism, we need to convert each
of these standard deviations to a relative value. This relative measure of dispersion is called the
co-efficient of variation (CV). The co-efficient of variation is defined as
(13)
Example
1. During the past few months, one runner averaged 12 miles per week with a standard
deviation of 2 miles, while another runner averaged 24 miles per week with a standard
deviation of 3 miles. Which of the two runners is relatively more consistent in his weekly
running habits?
Measures of Position
1. Quartiles
2. Deciles
3. Percentiles
The general formula for measure of position is given as
(14)
where
LCB is the lower class boundary for the position
P is the position
C is the class width of the position fcm is the cumulative frequency just before the positions
class boundary.
Measure of shape
Measures of shape determine whether the distribution of data exhibits a symmetric pattern or
stretch out in a particular direction. Two of such measures of shape are the skewness and
kurtosis.
Skewness
The skewness of a distribution indicates its degree of symmetry or nonsymmetry. It is measured
by the Pearson Co-efficient of skewness (Sk) , defined by
(15)
Where
x¯ is the mean
M is the median
S.D is the Standard Deviation
Interpretation
if Sk = 0, the distribution is said to be symmetric if Sk > 0, the distribution is
said to be skewed to the right if Sk < 0, the distribution is said to be skewed to
the left
Figure 2: Graph of symmetrical and Non-symmetrical distributions
Co-efficient of Peakness
The degree of peakness or kurtosis of a distribution is described by the coefficient of kurtosis, k
defined by;
(16)
which is compared to the value, 3.
Interpretation
If k = 3, the distribution is said to be normal.
If k < 3, the distribution is less peaked than the normal distribution
If k > 3, the distribution is more peaked than the normal distribution
Figure 3: Graphs of distributions indicating their peakness
Example
The following data gives the total number of fires in Ontario, Canada, for 11 months in the year
2002: 6, 13, 5, 7, 7, 3, 7, 2, 5, 9, 8. Compute the co-efficient of skewness.
EXERCISE
1. Here are grouped data for heights of 100 randomly selected male students;
Height(Inches)
Class
Frequency,(f)
Mark,(x)
59.5-62.5
61
5
62.5-65.5
64
18
65.5-68.5
67
42
68.5-71.5
70
27
71.5-74.5
73
8
a. Determine the coefficient of
i. Skewness
ii. Peakness
b. Interpret the results.
2. A study of the test scores for a course in Principles of Management and years of service of
the employees enrolled in a Business programme resulted in a mean score of 200 with
standard deviation, 40 and mean number of years of service of 20 with standard deviation
of 2. Compare the relative dispersion in the two distributions using the coefficient of
variation.
3. The variation in the annual incomes of executives is to be compared with the variation in
incomes of unskilled employees. For a sample
of executives, the mean income is $500,000 with standard deviation of $50,000 while that
of the unskilled employees have a mean of $22,000 with standard deviation, $2,200.
Compute the coefficients of variation for a meaningful comparison of variation in annual
incomes.
4. Calculate the mean deviation for the following data
Class
Frequency
0-
10-
20-
30-
40-
50-
9
19
29
39
49
59
6
7
15
16
4
2
Compute the
i. Mode
ii. Median
iii. Variance
iv. Standard deviation.
Graphical display of data and descriptive statistics
When conducting a statistical study, the researcher must gather data for the particular variable
under study. For example, if a researcher wishes to study the number of people who were bitten
by poisonous snakes in a specific geographic area over the past several years, he or she has to
gather the data from various doctors, hospitals, or health departments.
To describe situations, draw conclusions, or make inferences about events, the researcher must
organize the data in some meaningful way. The most convenient method of organizing data is to
construct a frequency distribution.
After organizing the data, the researcher must present them so they can be understood by those
who will benefit from reading the study. The most useful method of presenting the data is by
constructing statistical charts and graphs. There are many different types of charts and graphs,
and each one has a specific purpose
Frequency distribution
When working with large data sets, a frequency distribution (or frequency table) is often helpful
in organizing and summarizing data. A frequency distribution helps us to understand the nature
of the distribution of a data set.
A frequency distribution (or frequency table) shows how data are partitioned among several
categories (or classes) by listing the categories along with the number (frequency) of data values
in each of them.
Time(seconds) Class
Frequency
Boundary
75-124
74.5-124.5
11
125-174
124.5-174.5
24
175-224
174.5- 224.5
10
225-274
224.5-275.5
3
275-324
274.5-324.5
2
The frequency for a particular class is the number of original values that fall into that class. For
example, the first class has a frequency of 11, so 11 of the service times are between 75 seconds
and 124 seconds, inclusive.
Lower class limits are the smallest numbers that can belong to each of the different classes. (the
above table has lower class limits of 75, 125, 175, 225, and 275.)
Upper class limits are the largest numbers that can belong to each of the different classes. (the
above table has upper class limits of 124, 174, 224, 274, and 324.)
Class boundaries are the numbers used to separate the classes, but without the gaps created by
class limits.
Class midpoints are the values in the middle of the classes. The above table has class midpoints
of 99.5, 149.5, 199.5, 249.5, and 299.5. Each class midpoint can be found by adding the lower
class limit to the upper class limit and dividing the sum by 2.
Class width is the difference between two consecutive lower class limits (or two consecutive
lower class boundaries) in a frequency distribution. The above table uses a class width of 50.
(The first two lower class boundaries are 75 and 125, and their difference is 50.)
Graphical representation of data
After you have organized the data into a frequency distribution, you can present them in
graphical form. The purpose of graphs in statistics is to convey the data to the viewers in
pictorial form. It is easier for most people to comprehend the meaning of data presented
graphically than data presented numerically in tables or frequency distributions. This is
especially true if the users have little or no statistical knowledge.
Histogram
While a frequency distribution is a useful tool for summarizing data and investigating the
distribution of data, an even better tool is a histogram, which is a graph that is easier to interpret
than a table of numbers.
A histogram is a graph consisting of bars of equal width drawn adjacent to each other (unless
there are gaps in the data). The horizontal scale represents classes of quantitative data values, and
the vertical scale represents frequencies. The heights of the bars correspond to frequency values.
Important Uses of a Histogram
β–  Visually displays the shape of the distribution of the data
β–  Shows the location of the center of the data
β–  Shows the spread of the data
β–  Identifies outliers
From the table above
Histogram
30
frequency
25
20
15
10
5
0
74.5-124.5
124.5-174.5
174.5- 224.5
224.5-274.5
274.5-324.5
time (seconds)
The Ogive
A graph that can be used represents the cumulative frequencies for he classes. This type of graph
is called the cumulative frequency graph, or ogive. The cumulative frequency is the sum of the
frequencies accumulated up to the upper boundary of a class in the distribution.
The ogive is a graph that represents the cumulative frequencies for the classes in a frequency
distribution
With the same table above,
Time(seconds) Time less than
Cumulative
frequency
Less than 74.5
0
75-124
Less than124.5
11
125-174
Less than 174.5
35
175-224
Less than 224.5
45
225-274
Less than 275.5
48
275-324
Less than 324.5
50
OGIVE
cumulative frequency
60
50
40
30
20
10
0
0
50
100
150
200
250
300
350
times less than
The Pie Graph
The purpose of the pie graph is to show the relationship of the parts to the whole by visually
comparing the sizes of the sections. Percentages or proportions can be used. The variable is
nominal or categorical.
A pie graph is a circle that is divided into sections or wedges according to the percentage of
frequencies in each category of the distribution.
Example
This frequency distribution show calls received each shift by a local municipality for a recent
year. Construct a pie graph for the data.
Shift
Frequency Angle of sector
Percentage
Day
2594
2594
∗ 360° = 119°
7830
2594
∗ 100% = 33%
7830
Evening
2800
129°
36%
Night
2436
112°
31%
The frequency for each class must be converted to a proportional part of the circle. This
𝑓
conversion is done by using the formula π·π‘’π‘”π‘Ÿπ‘’π‘’π‘  = 𝑛 · 360°
Where f = frequency for each class and n = sum of the frequencies. Hence, the following
conversions are obtained. The degrees should sum to 360°.
Total frequency is 7830.
Using a protractor, graph each section and write its name and corresponding percentage as
shown
Frequency
31%
33%
Day
Evening
Night
36%
Introduction to Probability Theory
Terminologies and Notations
Experiment: An experiment is any process that generates well-defined outcomes. There are two
types of experiments, namely; deterministic and random (or chance) experiment.
In the deterministic experiments, the observed results are not subject to chance while the
outcomes of random experiments cannot be predicted with certainty
Trial: A trial is a single performance of an experiment (that is, a repetition of an experiment).
Outcome: The possible result of each trial of an experiment is called outcome.
Sample Space: It is the set of all possible outcomes of an experiment. It is denoted by the letter,
S.
Event: An event is a collection of one or more outcomes from an experiment which is a subset of
a sample space. It is denoted by a capital letter.
(17)
Axioms of Probability
• Axiom 1: For every event A, 0 ≤ P(A) ≤ 1
• Axiom 2: For every event A, P(A) ≥ 0
• Axiom 3: For the sure or certain event S, P(S) = 1
• Axiom 4: For any number of mutually exclusive events A1,A2,··· , P(A1 ∪ A2 ∪ ···) = P(A1)
+ P(A2) + ···
Some Important Definitions in Probability
• Events A and B are mutually exclusive if both cannot occur at the same time. ie A ∩ B = ∅
and P(A ∩ B = 0)
• A and B are independent events if and only if P(A∩B) = P(A)×P(B).
• If ∅ is an empty set, then P(∅) = 0
• If A0 is the complement of an event A, then P(A0) = 1-P(A)
• If A and B are any two events, then P(A∪B) = P(A)+P(B)−P(A∩B)
• Conditional Probability: Let A and B be two events in the sample space, S with P(B) > 0.
The probability that an event A occurs given that event B has already occurred, denoted
P(A|B), is called the conditional probability of A given B. The conditional probability of A
given B is defined as
(18)
EXERCISE
1. A box contains three balls. One red, one blue, and one yellow. Consider an experiment that
consists of withdrawing a ball from the box, replacing it, and withdrawing a second ball.
a. What is the sample space of this experiment?
b. What is the event that the first ball drawn is yellow?
c. What is the event that the same ball is drawn twice?
2. An experiment consists of flipping a coin three times and each time noting whether it lands
heads or tails.
a. What is the sample space of this experiment?
b. What is the event that tails occur more often than heads?
3. Suppose a coin is flipped twice. Assume that all four possibilities are equally likely to occur.
Find the conditional probability that both coins land heads given that the first one does.
Application of Counting Techniques
In a sample space with a large number of outcomes, determining the number of outcomes
associated with the events through direct enumeration could be tedious. In this section, we
develop some counting techniques and use them in probability computations. We shall examine
three basic counting techniques, namely the Multiplication Principle, Permutation and
Combination.
The Multiplication Principle
If an operation can be performed in n1 ways, a second operation can be performed in n2 ways and
so on for the kth operation which can be performed in nk ways, then the combined experiment or
operations can be performed in n1 × n2 × n3 × ···nk ways.
Example
• How many different 7-place license plates are possible if the rst 3 places are to be occupied
by letters and the final 4 by numbers?
Solution
By the generalized version of the basic principle, the answer is 26 · 26 · 26 · 10 · 10 · 10 ·
10 = 175,760,000.
• How many license plates would be possible if repetition among letters or numbers were
prohibited?
Solution.
In this case, there would be 26·25·24·10·9·8·7 = 78,624,000 possible license plates.
Permutation
An ordered arrangement of objects is called a permutation. The number of permutations of
• n distinct objects, taken all together is n! = n(n×1)(n×2)×···×3×2×1
• n distinct objects taken k at a time is
, where k < n
• n objects consisting of groups of which n1 of the first group are alike, n2 the second group
are alike and so on for the kth group with nk objects which are alike is
• n distinct objects arranged in a circle, called circular permutations is given by
(20)
Example
• How many distinct three-digit numbers can be formed using the digits
2, 4, 6, and 8 if no digit can be repeated?
Solution
The number of distinct three-digit numbers will be
.
• How many different letter arrangements can be formed from the letters PEPPER?
Solution
The 6- letter word, PEPPER has 3P’s, 2E’s and 1R. Hence, there are
possible letter arrangements of the letters PEPPER.
Combination
A combination is a selection of objects in which the order of selection does not matter. The
number of ways in which k objects can be selected from n distinct objects, irrespective of their
order is dened by
(21)
Example
• The number of ways of choosing a committee of 5 from 9 persons is
• In a tank containing 10 fishes, there are three yellow and seven black fishes. We select three
fishes at random.
a. What is the probability that exactly one yellow fish gets selected?
b. What is the probability that at least one yellow fish gets selected?
Solution
Let A be the event that exactly one yellow fish gets selected, and B be the event that at most
one yellow fish gets selected. There are 10C3 = 120 ways to select three fishes from 10.
a. There are 3C1 = 3 ways to select a yellow fish and 7C2 = 21 ways to select two black fishes.
By multiplication rule, the probability of selecting exactly one yellow fish is
.
b. The probability that at least one yellow fish gets selected is the same as
1 − P(none), which is 1 − 0.292 = 0.708.
Probability Distributions
Basic Concepts of a Probability Distribution
A random variable is a variable (typically represented by x) that has a single numerical value,
determined by chance, for each outcome of a procedure.
A probability distribution is a description that gives the probability for each value of the
random variable. It is often expressed in the format of a table, formula, or graph.
Random variables may also be discrete or continuous
A discrete random variable has a collection of values that is finite or countable. (If there are
infinitely many values, the number of values is countable if it is possible to count them
individually, such as the number of tosses of a coin before getting heads.)
A continuous random variable has infinitely many values, and the collection of values is not
countable. (That is, it is impossible to count the individual items because at least some of them
are on a continuous scale, such as body temperatures.)
Probability Distribution: Requirements
Every probability distribution must satisfy each of the following three requirements.
1. There is a numerical (not categorical) random variable π‘₯, and its number values are
associated with corresponding probabilities.
2. ∑ 𝑃(π‘₯) = 1 Where π‘₯ assumes all possible values. (The sum of all probabilities must be
1, but sums such as 0.999 or 1.001 are acceptable because they result from rounding
errors.)
3. 0 ≤ 𝑃(π‘₯) ≤ 1 for every individual value of the random variable x. (That is, each
probability value must be between 0 and 1 inclusive.)
The second requirement comes from the simple fact that the random variable x represents all
possible events in the entire sample space, so we are certain (with probability 1) that one of the
events will occur. The third requirement comes from the basic principle that any probability
value must be 0 or 1 or a value between 0 and 1.
Example
Construct probability distributions for the following random variables:
I. The number of heads when four fair coins are tossed.
II. The difference between the results of two fair dice rolled together.
Solution
I. The sample space for tossing four fair coins: S =HHHH, HHHT, HHTT,
HHTH, HTHH, HTHT, HTHH, HTTT, THHH, THHT, THTT, THTH, TTHH,
TTHT,TTHH, TTTT
The random variable, X is the number of heads occurring in that experiment which
assumes the values, X = 0, 1, 2, 3, 4. The required probability distributions is
x
0
1
2
3
4
P(x)
II. The table below indicates the difference between all possible pair outcomes
(Dice 1, Dice 2)
1
2
3
4
5
6
1
0
1
2
3
4
5
2
1
0
1
2
3
4
3
2
1
0
1
2
3
4
3
2
1
0
1
2
5
4
3
2
1
0
1
6
5
4
3
2
1
0
The random variable X is the difference occurring in that experiment which assumes the
values, X = 0, 1, 2, 3, 4, 5. The required probability distribution is given as
X
0
1
2
3
4
5
P(x)
Example
Let’s consider tossing two coins, with the following random variable:
x = number of heads when two coins are tossed
The above x is a random variable because its numerical values depend on chance.
With two coins tossed, the number of heads can be 0, 1, or 2, and the Table below is a
probability distribution because it gives the probability for each value of the random variable x
and it satisfies the three requirements listed earlier:
1. The variable x is a numerical random variable, and its values are associated with
probabilities, as in Table 5-1.
2. ∑ 𝑃(π‘₯) = 1 = 0.25 + 0.50 + 0.25 = 1
3. Each value of P(x) is between 0 and 1. (Specifically, 0.25 and 0.50 and 0.25 are each
between 0 and 1 inclusive.)
The random variable x in the Table below is a discrete random variable, because it has three
possible values (0, 1, 2), and three is a finite number, so this satisfies the requirement of
being finite or countable.
Probability Distribution for the Number of Heads in Two Coin Tosses
x: Number of Heads When Two Coins
P(x)
Are Tossed
0
0.25
1
0.5
2
0.25
EXERCISE
1. The daily demand of cake at a bakery at the beginning of the day has the probability
function given by
X
0
1
2
3
4
5
P(x)
0.15
0.20
0.35
0.15
0.10
0.05
Let X denote the number of cakes demanded.
I. Verify that it is a probability mass function.
II. Find the probability that there will be at most 3 orders.
2. If P(x) is a probability mass function, find k
,
elsewhere
3. The random variable k has the probability function;
, k = 1,2,3,...
Compute the value of b, the expected value and variance of k.
Continuous Probability Distribution
The probability distribution, f(x) is said to be probability density function of the continuous
random variable, x if for an interval of real numbers [a , b] the following properties are satisfied:
• f(x) ≥ 0 for any value of x
, where −∞ ≤ a ≤ x ≤ b ≤ ∞
Example
a. Let x be a continuous random variable with probability density function:
, elsewhere
Determine the value of k.
b. Determine the value of k and hence compute the probabilities, P(1 ≤ x ≤ 2) and P(x > 2)
π‘˜π‘₯,
0 ≤ π‘₯ ≤ 3, π‘˜ > 0
{3π‘˜(4 − π‘₯)
3<π‘₯≤4
0,
π‘’π‘™π‘ π‘’π‘€β„Žπ‘’π‘Ÿπ‘’
Cumulative Distributive Function
The cumulative distribution function (cdf ) for a random variable x, denoted, F(x) is defined by
F(x) = P(X ≤ x). If x is a discrete random variable with probability mass function, P(x) then, F(x)
= ∑ 𝑃(𝑑) which is a step function. If X is, however, a continuous random variable with
where −∞ ≤ x ≤ ∞,
probability density function, f(x), Then
and P(x1 ≤ x ≤ x2) = F(x2) − F(x1)
Properties of F(x)
• F(a) ≤ F(b), wherever a ≤ b
• limx→−∞ F(x) = 0 and limx→∞ F(x) = 1
• 0 ≤ F(x) ≤ 1
Exercise
1. A random variable X has the following distribution:
X
-5
0
3
6
P(x)
0.2
0.1
0.4
0.3
Find the cumulative distribution function F(x) .
2. The CDF of a discrete random variable X is given in the following table:
X
-1
0
2
5
6
P(x)
0.1
0.15
0.4
0.8
1
a. Find P(X = 2)
b. Find P(X > 0).
3. Let the function:
𝑓(π‘₯) = {
𝐢π‘₯ 2 , 0 < π‘₯ < 3
0, π‘’π‘™π‘ π‘’π‘€β„Žπ‘’π‘Ÿπ‘’
a. Find the value of c so that f(x) is a density function.
b. Compute P(2 < X < 3).
c. Find the distribution function F(x).
4. The random variable X has a cumulative distribution function:
Find the probability density function of X.
Parameters of a Probability Distribution
Remember that with a probability distribution, we have a description of a population instead
of a sample, so the values of the mean, standard deviation, and variance are parameters, not
statistics. The mean, variance, and standard deviation of a discrete probability distribution
can be found with the following formulas
Mean 𝝁 for a probability distribution
πœ‡ = ∑(π‘₯ βˆ™ 𝑃(π‘₯)), if x is discrete
, if x is continuous and −∞ ≤ x ≤ ∞
Variance 𝝈𝟐 for a probability distribution
The variance of the random variable, x with probability distribution, p(x) or f(x) defined by;
π‘‰π‘Žπ‘Ÿ(π‘₯) = 𝜎 2 = ∑(π‘₯ − πœ‡)2 βˆ™ 𝑃(π‘₯))
Or
π‘‰π‘Žπ‘Ÿ(π‘₯) = 𝜎 2 = ∑(π‘₯ 2 βˆ™ 𝑃(π‘₯)) − πœ‡ 2 𝑖𝑓 π‘₯ 𝑖𝑠 π‘‘π‘–π‘ π‘π‘Ÿπ‘’π‘‘π‘’
π‘‰π‘Žπ‘Ÿ(π‘₯)
(x)
if x is continuous.
Standard deviation 𝝈 is the positive square root of the variance
Median of a distribution
The median of a distribution of the random variable x is that value of x = m such that P(x ≤
m) or P(x ≥ m) = 0.5 or close to it. The median is obtained by the equation;
I.
or close to it, if x is discrete.
if x is continuous and is such that a ≤ x ≤ b
II.
Mode
The mode of a distribution of random variable x is that value of x = m0 that maximizes the
probability distribution function, p(x) or f(x).
Finding the Mean, Variance, and Standard Deviation
Using this table below, find the mean variance and standard deviation.
P(x)
π‘₯𝑃(π‘₯)
π‘₯ 2 𝑃(π‘₯)
0
0.25
0
0
1
0.5
0.5
0.5
2
0.25
0.5
1
Total
1
1
1.5
x: Number of Heads When Two Coins
Are Tossed
Therefore the mean
πœ‡ = ∑(π‘₯ βˆ™ 𝑃(π‘₯)) = 1
And variance
𝜎 2 = ∑(π‘₯ 2 βˆ™ 𝑃(π‘₯)) − πœ‡ 2 = 1.5 − 12 = 0.5
Standard deviation is √0.5 = 0.707
Example
1. Let
1
1
Then 𝐸(𝑋) = 1 (2) + 0 (2) = 1 /2
2. Let X be a discrete random variable whose probability density function is given in the
following table:
X
-1
0
1
2
3
P(x)
Find E(X) and the standard deviation of the random variable x
3. Let Y be a random variable with pdf
a. Find the expected value and variance of Y.
b. Let X = 300Y + 50. Find E(X) and Var(X), and
c. Find P(X > 750)
d. Determine the median
4
5
SPECIAL PROBABILITY DISTRIBUTIONS
Discrete Probability Distributions
1. Bernoulli Process
2. Binomial Distribution
3. Poisson Distribution
Continuous Probability Distributions
1. The Uniform Distribution
2. The Exponential Distribution
3. The Normal Distribution
Discrete Probability Distribution
Bernoulli Process
A random variable x is said to have a Bernoulli distribution if it assumes the values, 0 and 1 for
two outcomes. The probability distribution for the success in the trial, x is defined by
𝑃(π‘₯) = 𝑝 π‘₯ (1 − 𝑝)1−π‘₯ , π‘₯ = 0,1π‘Žπ‘›π‘‘0 < 𝑝 < 1 (22)
where the mean and variance of the distribution are as follows
µ = E(x) = p, and σ2 = V ar(x) = p(1 − p)
Binomial Probability Distributions
Binomial probability distributions allow us to deal with circumstances in which the outcomes
belong to two categories, such as heads/tails or acceptable/defective or survived/died.
A binomial probability distribution results from a procedure that meets these four requirements:
1. The procedure has a fixed number of trials. (A trial is a single observation.)
2.
The trials must be independent, meaning that the outcome of any individual trial
doesn’t affect the probabilities in the other trials.
3.
Each trial must have all outcomes classified into exactly two categories, commonly
referred to as success and failure.
4. The probability of a success remains the same in all trials.
Notation for Binomial Probability Distributions
S and F (success and failure) denote the two possible categories of all outcomes.
𝑃(𝑆) = p
(p = probability of a success)
𝑃(𝐹) = 1 - p = q
(q = probability of a failure)
n the fixed number of trials
x a specific number of successes in n trials, so x can be any whole number between 0 and n,
inclusive
p probability of success in one of the n trials
q probability of failure in one of the n trials
𝑃(π‘₯) probability of getting exactly x successes among the n trials
In a binomial probability distribution, probabilities can be calculated by using Formula
𝑛
𝑃(π‘₯) = ( ) βˆ™ 𝑝 π‘₯ βˆ™ π‘ž 𝑛−π‘₯
π‘₯
For π‘₯ = 0,1,2,3, … . . , 𝑛
Where
n = number of trials
x = number of successes among n trials
p = probability of success in any one trial
q = probability of failure in any one trial (π‘ž = 1 – 𝑝)
𝑛
𝑛!
( )=
(𝑛 − π‘₯)! π‘₯!
π‘₯
Example
Given that there is a 0.85 probability that a randomly selected adult knows what Twitter is, use
the binomial probability formula to find the probability that when five adults are randomly
selected, exactly three of them know what Twitter is.
We are to find P(3) given that n = 5, x = 3, p = 0.85, and q = 0.15.
Using the formula
𝑛
𝑃(π‘₯) = ( ) βˆ™ 𝑝 π‘₯ βˆ™ π‘ž 𝑛−π‘₯
π‘₯
𝑃(π‘₯) =
𝑃(3) =
𝑛!
βˆ™ 𝑝 π‘₯ βˆ™ π‘ž 𝑛−π‘₯
(𝑛 − π‘₯)! π‘₯!
5!
βˆ™ (0.85)3 βˆ™ (0.15)5−3
(5 − 3)! 3!
= (10)(0.614125)(0.0225) = 0.138178
= 0.138 (rounded to three significant digits)
Poisson Probability Distributions
The following definition states that Poisson distributions are used with occurrences of an event
over a specified interval, and here are some applications:
β–  Number of Internet users logging onto a website in one day
β–  Number of patients arriving at an emergency room in one hour
β–  Number of Atlantic hurricanes in one year
A Poisson probability distribution is a discrete probability distribution that applies to
occurrences of some event over a specified interval. The random variable x is the number of
occurrences of the event in an interval. The interval can be time, distance, area, volume, or some
similar unit. The probability of the event occurring x times over an interval is given by
πœ‡ π‘₯ βˆ™ π‘’πœ‡
𝑃(π‘₯) =
, π‘₯ = 0,1,2,3 … . π‘Žπ‘›π‘‘ πœ‡ > 0
π‘₯!
where
e ≈ 2.71828
πœ‡ = mean number of occurrences of the event in the intervals
The mean and variance are the same 𝐸(π‘₯) = µ = π‘‰π‘Žπ‘Ÿ(π‘₯) . The distribution of x may simply
be denoted as π‘₯ ∼ 𝑝(µ)
Example
1. If X is a Poisson random variable with parameter λ = 2, find P(X = 0)
Solution
Using the facts that 20 = 1 and 0! = 1,
we obtain 𝑃(𝑋 = 0) = 𝑒 −2 = 0.1353
Continuous Probability Distributions
Uniform Distribution
A random variable X is said to have a uniform probability distribution on (a, b), denoted by
U(a,b), if the density function of X is given by
where the mean and variance are
and
Figure 4: Uniform probability density
Example
1. If X is a uniformly distributed random variable over (0, 10), calculate the probability that
a. X < 3
b. X > 6
c. 3 < X < 8.
Solution
a.
b.
c.
2. You are to meet a friend at 2 p.m. However, while you are always exactly on time, your
friend is always late and indeed will arrive at the meeting place at a time uniformly
distributed between 2 and 3 p.m. Find the probability that you will have to wait
a. At least 30 minutes
b. Less than 15 minutes
c. Between 10 and 35 minutes
d. Less than 45 minutes
3. Buses arrive at a specified stop at 15-minute intervals starting at 7 A.M. That is, they arrive
at 7, 7:15, 7:30, 7:45, and so on. If a passenger arrives at the stop at a time that is uniformly
distributed between 7 and 7:30, find the probability that he waits
a. less than 5 minutes for a bus;
b. more than 10 minutes for a bus
4. The melting point, X, of a certain solid may be assumed to be a continuous random variable
that is uniformly distributed between the temperatures 100oC and 120oC. Find the
probability that such a solid will melt between 112oC and 115oC.
Exponential Distribution
A continuous random variable whose probability density function is given, for some λ > 0, by
,
elsewhere
is said to be an exponential random variable (or, more simply, is said to be exponentially
distributed) with parameter λ > 0 as the mean. The random variable x represents length of time
or space.
and
Example
1. The time, in hours, during which an electrical generator is operational is a random variable
that follows the exponential distribution with λ = 160 What is the probability that a generator
of this type will be operational for
a. Less than 40 hours?
b. Between 60 and 160 hours?
c. More than 200 hours?
2. The time (in hours) required to repair a machine is an exponentially distributed random
variable with parameter λ = 12. What is
a. the probability that a repair time exceeds 2 hours?
b. the conditional probability that a repair takes at least 10 hours, given that its duration
exceeds 9 hours?
3. The number of years a radio functions is exponentially distributed with parameter λ = 18. If
Jones buys a used radio, what is the probability that it will be working after an additional 8
years?
Normal Distribution
The probability density function for the normal random variable, x which is simply called normal
distribution is defined by
Where σ > 0, E(x) = µ and Var (x) = σ2
A random variable modelled by the Normal distribution with mean, µ and variance, σ2 is
denoted as x ∼ N(µ,σ2) The normal probability density function is a bell-shaped density curve
that is symmetric about the value µ. Its variability is measured by σ. The larger σ is, the more
variability there is in this curve. The Figure below presents three different normal probability
density functions. Note how the curves flatten out as σ increases.
Figure 5: Three normal probability density functions
Computations of Probabilities of Normal Random Variable
To compute the probability that x lies within the interval [a, b], P(a ≤ x ≤ b) the normal random
variable, x is standardized using the transformation,
, called the Z-score
Example
1. Find the following probabilities using the normal table
i.
P(z ≤1.95)
ii.
P(1.18 ≤ z ≤ 0.48)
iii.
P(0 ≤ z ≤ 2.58)
iv.
P(z > 2.63)
v.
P(−2.35 ≤ z ≤ 2.35)
2. Suppose that y ∼ N(6,4). What percentage will y fall between 5 and 10 ?
3. Le tX ∼ N(12,5). Find the value of x0 such that
a. P(X > x0) = 0.05
b. P(X < x0) = 0.98
c. P(X < x0) = 0.20
d. P(X > x0) = 0.90.
4. The scores, X, of an examination may be assumed to be normally distributed with µ = 70 and
σ2 = 49. What is the probability that:
a. A score chosen at random will be between 80 and 85?
b. A score will be greater than 75?
c. A score will be less than 90?
d. Interpret the meaning of (a), (b), and ( c ).
The Standard Normal Distribution
The Standard Score
The Standard score, or z-score, represents the number of standard deviations a random variable x
will fall from the mean.
𝑧=
π‘₯−πœ‡
𝜎
Example
The test scores for a civil service exam are normally distributed with a mean of 152 and a
standard deviation of 7. Find the standard z-score for a person with a score of:
161, 148 and 152
Solution
𝑧=
𝑧=
161 − 152
= 1.29
7
148 − 152
= −0.57
7
𝑧=
152 − 152
=0
7
Finding Probabilities
To find the probability that z is less than a given value, read the cumulative area in the table
corresponding to that z-score.
Eg. 𝑃(𝑧 < −1. 45) = 0.0735
To find the probability that z is greater than a given value, subtract the cumulative area in the
table from 1.
Eg. 𝑃(𝑧 > −1.24) = 1 – 0.1075 = 0.8925
To find the probability that z is between two given values, find the cumulative area for each and
subtract the smaller area from the larger.
Eg. 𝑃(−1.25 < 𝑧 < 1.17) = 0.8790 − 0.1056 = 0.7734
Exercise
1. (a) Find the following probabilities using the normal table
i. 𝑃(𝑧 ≤ −1.95)
ii 𝑃(−1.18 ≤ 𝑧 ≤ 0.48)
iii. 𝑃(0 ≤ 𝑧 ≤ 2.58)
iv. 𝑃(𝑧 > 2.63)
v. 𝑃(−2: 35 ≥ 𝑧 ≥ 2: 35)
b. Suppose that 𝑦 ~𝑁(6, 4). What percentage will y fall between 5 and 10?
2.(a) The nicotine content of a brand of cigarettes is normally distributed with a mean of 2:0mg
and a standard deviation of 0.25mg. What is the probability that a cigarette will have nicotine
content
i. of 1.65mg or less?
ii. between 1.50mg and 2.25mg?
iii. of 2.18mg or more?
b. The weekly amount spent for maintenance and repairs in a certain company was observed,
over a long period of time, to be approximately normally distributed with a mean of $400 and a
standard deviation of $20.
If $450 is budgeted for the week, what is the probability that the actual costs will exceed the
budgeted amount?
How much should be budgeted for weekly repairs and maintenance in order for the budgeted
amount is exceeded with a probability of 0.1?
Central Limit Theorem
The Central Limit Theorem states that, under rather general conditions, sums and means of
random samples of measurements drawn from a population tend to have an approximately
normal distribution.
Let a random sample of size n observations be selected from a population with mean πœ‡ and
variance, 𝜎 2 . The sampling distribution of the sample mean (π‘₯Μ… ) will be approximately normally
distributed with mean, πœ‡ π‘₯ = πœ‡ and standard deviation 𝜎π‘₯ =
𝜎
√𝑛
, provided n is sufficiently large.
Example
The mean height of African men (ages 20-29) is πœ‡ = 69.2 and 𝜎 = 2.9 inches. Random samples
of 60 such men are selected.
Find the
i mean and standard deviation of sampling distribution.
ii probability that the mean of the height is greater than 70
Solution
πœ‡ = 69.2
𝜎 = 2.9
Distribution of means of sample size 60, πœ‡π‘₯ =π‘₯Μ… = πœ‡ = 69.2 will be normal.
𝜎π‘₯Μ… =
2.9
√60
= 0.3744
Find the z-score for a sample mean of 70
𝑧=
π‘₯ − πœ‡ 70 − 69.2
=
= 2.14
𝜎π‘₯Μ…
0.3744
P(π‘₯Μ… > 70) = P(z > 2.14)
1 – 0.9838 = 0.0162
Example
a. If x and y are independent normal random variables with
𝐸(π‘₯) = 1; π‘‰π‘Žπ‘Ÿ (π‘₯) = 4; 𝐸(𝑦) = 10 π‘Žπ‘›π‘‘ π‘‰π‘Žπ‘Ÿ (𝑦) = 9,
Determine the following:
i 𝐸(2π‘₯ + 3𝑦) and π‘‰π‘Žπ‘Ÿ (2π‘₯ + 3𝑦)
ii 𝑃(2π‘₯ + 3𝑦 < 40)
Solution
Given that π‘₯ ~ 𝑁(1, 4) and y ~ N (10, 9), we let 𝑇 = 2π‘₯ + 3𝑦,
Then
πœ‡ 𝑇 = E(T) = E(2x + 3y)
2E(x) + 3E(y)
2(1) + 3(10) = 32
𝜎 2 = π‘‰π‘Žπ‘Ÿ (𝑇) = π‘‰π‘Žπ‘Ÿ (2π‘₯ + 3𝑦)
22 π‘‰π‘Žπ‘Ÿ (π‘₯) + 32 π‘‰π‘Žπ‘Ÿ (𝑦)
4(4) + 9(9) = 97 = 9.852
ii. From (i) T ~ N(32, 97)
40 − 32
𝑃 < 40 = ∅ (
)
9.85
∅(0.81) = 0.7910
Exercise
The mass of a biscuit is a normal random variable (x) with mean 50 grams and a standard
deviation of 4 grams. If a packet contains 20 biscuits and the mass of the packaging material is
also normal random variable with mean 100 grams and standard deviation, 3 grams. Find the
probability that the mass of the total packet
i. Will exceed 1,047 grams
ii. Lies between 1,050 and 1,200 grams
The Normal Approximation to Binomial
The Normal distribution provides a good approximation to the binomial distribution when the
number of trials, n is large, probability of a success in a trial, p not close to 0 or 1 and both np
and np (1 - p) are greater than 5. Thus the binomial random variable, x becomes approximately
normal random variable with mean, πœ‡ = np and variance, 𝜎 2 = np (1 - p).
To improve upon the approximation, a continuity correction may be utilized by adding or
subtracting 0.5 to/from x to account for the fact that a discrete distribution is being approximate
by a continuous distribution. In this case the standardized random variable thus becomes
𝑧 =
π‘₯ ± 0.5 − 𝑛𝑝
√𝑛𝑝(1 − 𝑝)
Example
Suppose that x has a Binomial distribution with n = 200 and p = 0.4. Using the continuity
correction use the Normal approximation to Binomial to find each of the following probabilities:
ο‚·
𝑃(π‘₯ = 90)
ο‚·
𝑃(π‘₯ ≤ 95)
ο‚·
𝑃(π‘₯ > 65)
ο‚·
𝑃(π‘₯ < 60)
ο‚·
𝑃(70 < π‘₯ < 100)
Solution
πœ‡= 200(0.4) = 80 and 𝜎 =√200(0.4)(0.6) = 6.9282
ο‚·
𝑃(π‘₯ = 90) = 𝑃(89.5 ≤ π‘₯ ≤ 90.5)
90.5 − 80
89.5 − 80
= ∅(
) −= ∅ (
)
6.9282
6.9282
∅(2.81) − ∅(−1.37) = 0.0210
ο‚·
𝑃(π‘₯ ≤ 95) = 𝑃(π‘₯ ≤ 95.5)
= ∅(
95.5 − 80
)
6.9282
∅(2.81) = 0.9875
ο‚·
𝑃(π‘₯ > 65) = 1 − 𝑃(π‘₯ < 65.5)
65.5 − 80
= 1−∅(
)
6.9282
1 − ∅(−2.09) = 1 − 0.0183 = 0.9817
ο‚·
𝑃(π‘₯ < 60) = (𝑃 ≤ 59.5)
= ∅(
59.5 − 80
)
6.9282
∅(−2.96) = 0.015
ο‚·
𝑃(70 < π‘₯ < 100) = 𝑃(70.5 ≤ π‘₯ ≤ 99.5)
= ∅(
99.5 − 80
70.5 − 80
) − ∅(
)
6.9282
6.9282
= ∅(2.81) − ∅(−1.37)
= 0.9975 − 0.0853 = 0.9122
Exercise
A manufacturer of components for electric motors has found that about 10% of the production
will not meet customer specifications. If 500 components are examined,
Find the expected number of components which did not meet customer specifications.
Find the probability that exactly 52 components or more did not meet customer specifications.
Find the probability that between 36 and 58 (inclusive) components did not meet customer
specifications.
Confidence Intervals
Point Estimate
A point estimate is a single value estimate for a population parameter. The best point estimate of
the population mean πœ‡ is the sample mean π‘₯Μ… .
Interval Estimate
An interval estimate is an interval or range of values used to estimate a population parameter.
The level of confidence, x, is the probability that the interval estimate contains the population
parameter.
Confidence Level
The confidence level of an interval estimate of a parameter is the probability that the interval
estimate will contain the parameter.
Maximum Error of Estimate
The maximum error of estimate E is the maximum likely difference between the point estimate
of a parameter and the actual value of the parameter.
𝐸 = 𝑧𝑐 𝜎π‘₯ = 𝑧𝑐
𝜎
√𝑛
when n ≥ 30, the sample standard deviation, s, can be used for 𝜎
Confidence Intervals
A confidence interval is a specific interval estimate of a parameter determined by using data
obtained from a sample and by using the specific confidence level of the estimate.
A confidence interval for the population mean is
π‘₯Μ… − 𝐸 < πœ‡ < π‘₯Μ… + 𝐸
Example
The president of a large university wishes to estimate the average age of the students presently
enrolled. From past studies, the standard deviation is known to be 2 years. A sample of 50
students is selected, and the mean is found to be 23.2 years. Find the 95% confidence interval of
the population mean.
Solution
Since the 95% confidence interval is desired, 𝑧𝛼 = 1.96. Hence, substituting in the formula
2
π‘₯Μ… − 𝑧𝛼
2
𝜎
√𝑛
< πœ‡ < π‘₯Μ… + 𝑧𝛼
2
𝜎
√𝑛
2
2
23.2 − 1.96 (
) < πœ‡ < 23.2 + 1.96 (
)
√50
√50
22.6 < πœ‡ < 23.8
Hence, the president can say, with 95% confidence, that the average age of the students is
between 22.6 and 23.8 years, based on 50 students.
Exercises
ο‚·
A survey of 30 adults found that the mean age of a person's primary vehicle is 5.6 years.
Assuming the standard deviation of the population is 0.8 year, find the 99% confidence
interval of the population mean.
ο‚·
The following data represent a sample of the assets (in millions of dollars) of 30 credit
unions in southwestern Pennsylvania. Find the 90% confidence interval of the mean.
12.23 16.56 4.39 2.89 1.24 2.17 13.19 9.16 1.42 73.25 1.91
14.64 11.59 6.69 1.06 8.74 3.17 18.13 7.92 4.78 16.85 40.22
2.42 21.58 5.01 1.47 12.24 2.27 12.77 2.76
Formula for the Minimum Sample Size Needed for an Interval Estimate of the Population
Mean
𝑛=(
𝑧𝑐 𝜎 2
)
𝐸
where E is the maximum error of estimate.
Example
The college president asks the statistics teacher to estimate the average age of the stu- dents at
their college. How large a sample is necessary? The statistics teacher would like to be 99%
confident that the estimate should be accurate within 1 year and a standard deviation of 3.
Solution
𝑛=(
𝑧𝑐 𝜎 2
)
𝐸
E=1, 𝑧𝑐 = 2.58, 𝜎 = 3
2.58(3) 2
𝑛=(
) = 59.9 ≈ 60
1
Therefore, to be 99% confident that the estimate is within 1 year of the true mean age, the
teacher needs a sample size of at least 60 students.
Exercise
a You want to estimate the mean one-way fare. How many fares must be included in your sample
if you want to be 95% confident that the sample mean is within $ 2 of population mean?
b The growing seasons for a random sample of 35 U.S. cities were recorded, yielding a sample
mean of 190.7 days and a sample standard deviation of 54.2 days. Estimate the true mean
population of the growing season with 95% confidence.
c How many cities' growing seasons would have to be sampled in order to estimate the true mean
growing season with 95% confidence within 2 days? (Use a standard deviation 54.2)
d A restaurant owner wishes to find the 99% confidence interval of the true mean cost of a dry
martini. How large should the sample be if she wishes to be accurate within $0.10? A previous
study showed that the standard deviation of the price was $0.12.
Confidence Intervals for the mean (Small samples)
If the distribution of a random variable x is normal and n < 30, then the sampling distribution of
π‘₯Μ… is a t-distribution with n – 1 degrees of freedom.
Degrees of freedom
They are the number of values that are free to vary after a sample statistic has been computed.
For example if the mean of 5 values is 10, then 4 of the 5 values are free to vary. But once 4
values are selected, the fifth value must be a specific number to get a sum of 50, since 50/5 = 10.
Hence, the degrees of freedom are 5 -1 = 4, and this value tells the researcher which t curve to
use.
Confidence interval for small samples
𝑠
Maximum error of estimate 𝐸 = 𝑑𝑐 ( 𝑛)
√
Formula for a Specific Confidence Interval for the Mean When 𝜎 Is Unknown and n < 30
π‘₯Μ… − 𝑑𝑐
𝑠
√𝑛
< πœ‡ < π‘₯Μ… + 𝑑𝑐
𝑠
√𝑛
The degrees of freedom are n - 1.
Example
Find the 𝑑𝛼 value for a 95% confidence interval when the sample size is 22.
2
Solution
The d.f = 22 - 1, or 21. Find 21 in the left column and 95% in the row labeled "Confidence
intervals." The intersection where the two meet gives the value for 𝑑𝛼 , which is 2.080. See Figure
2
below
Figure: t table
Example
Ten randomly selected automobiles were stopped, and the tread depth of the right front tire was
measured. The mean was 0.32 inch, and the standard deviation was 0.08 inch. Find the 95%
confidence interval of the mean depth. Assume that the variable is approximately normally
distributed.
Solution
Since 𝜎 is unknown and s must replace it, the t distribution must be used for 95% confidence
interval. Hence, with 9 degrees of freedom, t = 2.262:
π‘₯Μ… − 𝑑𝑐
0.32 − (2.262) (
𝑠
√𝑛
< πœ‡ < π‘₯Μ… + 𝑑𝑐
𝑠
√𝑛
0.08
0.08
) < πœ‡ < 0.32 + (2.262) (
)
√10
√10
0.26 < πœ‡ < 0.38
Therefore, one can be 95% confident that the population mean tread depth of all right front tires
is between 0.26 and 0.38 inch based on a sample of 10 tires.
Exercises
a The data represent a sample of the number of home fires started by candles for the past several
years. Find the 99% confidence interval for the mean number of home fires started by candles
each year. 5460 5900 6090 6310 7160 8440 9930
b The average hemoglobin reading for a sample of 20 teachers was 16 grams per 100 milliliters,
with a sample standard deviation of 2 grams. Find the 99% confidence interval of the true mean.
c A sample of 17 states had these cigarette taxes (in cents):
112 120 98 55 71 35 99 124 64 150 150 55 100 132 20 70 93
Find a 98% confidence interval for the cigarette tax in all 50 states.
d The number of grams of carbohydrates in a 12- ounce serving of a regular soft drink is listed
here for a random sample of sodas. Estimate the mean number of carbohydrates in all brands of
soda with 95% confidence. 48 37 52 40 43 46 41 38
41 45 45 33 35 52 45 41 30 34 46 40
Figure: Summary on when to use z or t distribution
Population Proportions
A proportion represents a part of a whole. The proportion of successes in a sample is given by
π‘₯
𝑝̂=𝑛
where x is the number of sample units that possess the characteristics of interest and n is sample
size.
π‘žΜ‚ is the point estimate for the proportion of failures where
π‘žΜ‚ = 1 − 𝑝̂
If 𝑛𝑝 ≥ 5 and π‘›π‘ž ≥ 5 the sampling distribution for ^p is normal.
Confidence interval for population proportions
The maximum error of estimate, E, for confidence interval is:
𝐸 = 𝑧𝑐 √
𝑝̂ π‘žΜ‚
𝑛
The confidence interval for the population proportion, p, is
𝑝̂ − 𝐸 < 𝑝 < 𝑝̂ + 𝐸
Example
A sample of 500 nursing applications included 60 from men. Find the 90% confidence interval of
the true proportion of men who applied to the nursing program.
Solution
𝑝̂ =
60
= 0.12 π‘Žπ‘›π‘‘ π‘ž Μ‚ = 1 – 0.12 = 0: 88 𝑧𝑐 = 1.65.
500
But
𝑝̂ π‘žΜ‚
𝑝̂ π‘žΜ‚
𝑝̂ − 𝑧𝑐 √ < 𝑝 < 𝑝̂ + 𝑧𝑐 √
𝑛
𝑛
0.12 − 1.65√
0.12(0.88)
0.12(0.88)
< 𝑝 < 0.12 + 1.65√
500
500
0.096 < 𝑝 < 0.144
Hence, one can be 90% confident that the percentage of applicants who are men is between 9.6%
and 14.4%.
Note
If no approximation of ^p is known, one should use ^p = 0.5.
Exercises
ο‚·
In a study of 1907 fatal traffic accidents, 449 were alcohol related. Construct a 99%
confidence interval for the proportion of fatal traffic accidents that are alcohol related.
ο‚·
A survey of 200,000 boat owners found that 12% of the pleasure boats were named
Serenity. Find the 95% confidence interval of the true proportion of boats named
Serenity.
ο‚·
A survey found that out of 200 workers, 168 said they were interrupted three or more
times an hour by phone messages, faxes, etc. Find the 90% confidence interval of the
population proportion of workers who are interrupted three or more times an hour.
Minimum Sample Size
If you have a preliminary estimate for p and q, the minimum sample size given a confidence
interval and a maximum error of estimate needed to estimate p is
𝑧𝑐 2
𝑛 = 𝑝̂ π‘žΜ‚ ( )
𝐸
Example
You wish to estimate the proportion of fatal accidents that are alcohol related at a 99% level of
confidence. Find the minimum samples size needed to be accurate to within 2% of the population
proportion. Use an estimate of p = 0.235.
Solution
𝑧𝑐 2
𝑛 = 𝑝̂ π‘žΜ‚ ( )
𝐸
2.575 2
𝑛 = (0.235)(0.765) (
) = 2980.05
0.02
With a preliminary sample you need at least n = 2981 for your sample
Exercise
ο‚·
A researcher wishes to estimate, with 95% confidence, the proportion of people who own
a home computer. A previous study shows that 40% of those interviewed had a computer
at home. The researcher wishes to be accurate within 2% of the true proportion. Find the
minimum sample size necessary.
ο‚·
The Gallup Poll found that 27% of adults surveyed nationwide said they had personally
been in a tornado. How many adults should be surveyed to estimate the true proportion of
adults who have been in a tornado with a 95% confidence interval 5% wide?
ο‚·
A researcher wishes to estimate the proportion of executives who own a car phone. She
wants to be 90% confident and be accurate within 5% of the true proportion. Find the
minimum sample size necessary.
Confidence Intervals for Variance and Standard Deviation
To calculate the confidence intervals for Variance and Standard deviation, a new statistical
distribution is needed. It is called the chi-square distribution. The chi-square variable is similar to
the t variable in that its distribution is a family of curves based on the number of degrees of
freedom. The symbol for chi-square is πœ’ 2
Confidence Intervals for Variance
(𝑛 − 1)𝑠 2
πœ’π‘…2
Degree of freedom = n - 1
2
<𝜎 <
(𝑛 − 1)𝑠 2
πœ’πΏ2
Example
2
2
Find the values for πœ’π‘…π‘–π‘”β„Žπ‘‘
and πœ’πΏπ‘’π‘“π‘‘
for a 90% confidence interval when n = 25.
Solution
When the sample is 25, there are 24 degree of freedom.
2
πœ’π‘…π‘–π‘”β„Žπ‘‘
is
1 – 0.9
2
= 0.05
2
πœ’π‘…π‘–π‘”β„Žπ‘‘
= 36.415
2
πœ’πΏπ‘’π‘“π‘‘
𝑖𝑠
1 + 0.9
2
= 0.95
2
πœ’πΏπ‘’π‘“π‘‘
= 13.848
𝝌𝟐 for the example above
Example
Find the 95% confidence interval for the variance of the nicotine content of cigarettes
manufactured if a sample of 20 cigarettes has a standard deviation of 1.6 milligrams.
Solution
Since 𝛼 = 0.05, the two critical values, respectively, for the 0.025 and 0.975 levels for 19
degrees of freedom are 32.852 and 8.907.
(𝑛 − 1)𝑠 2
(𝑛 − 1)𝑠 2
2
<
𝜎
<
πœ’π‘…2
πœ’πΏ2
(20 − 1)(1.6)2
(20 − 1)(1.6)2
2
<𝜎 <
32.852
8.907
1.5 < 𝜎 2 < 5.5
Hence, one can be 95% confident that the true variance for the nicotine content is between 1.5
and 5.5.
Exercises
ο‚·
Find the 90% confidence interval for the variance and standard deviation for the price in
dollars of an adult single-day beach ticket. The data represent a selected sample of
nationwide beach resorts. Assume the variable is normally distributed.
59 54 53 52 51 39 49 46 49 48
ο‚·
Find the 99% confidence interval for the variance and standard deviation of the weights
of 25 one-gallon containers of motor oil if a sample of 14 containers has a variance of
3.2.
The weights are given in ounces. Assume the variable is normally distributed.
ο‚·
A random sample of stock prices per share (in dollars) is shown. Find the 90%
confidence interval for the variance and standard deviation for the prices. Assume the
variable is normally distributed.
26.69 13.88 28.37 75.37 7.50 47.50 3.81 53.81 13.62 6.94
28.25 28.00 40.25 10.87 46.12 12.00 43.00 45.12 60.50 14.75
Hypothesis Testing
A statistical hypothesis is a claim about a population.
Null Hypothesis
It is denoted by 𝐻0 . It contains a statement of equality such as ≥, = π‘œπ‘Ÿ ≤The null hypothesis is
assumed to be true unless there is strong evidence to the contrary { similar to how a person is
assumed to be innocent until proven guilty.
Alternative Hypothesis
It is denoted by π»π‘Ž . It contains a statement of inequality such as <, ≠ π‘œπ‘Ÿ >
Example
ο‚·
A hospital claims its ambulance response time is less than 10 minute.
Solution
𝐻0 : πœ‡ ≥ 10π‘šπ‘–π‘›
π»π‘Ž : πœ‡ < 10π‘šπ‘–π‘›
ο‚·
A costumer magazine claims the proportion of cell phones calls made during evenings
and weekends is at most 60%
Solution
𝐻0 ∢ 𝑝 ≤ 0.60
π»π‘Ž ∢ 𝑝 > 0.60
Errors and Level of Significance
Type I error
We reject the null hypothesis when the null is true. The probability of Type I error = 𝛼
Type II error
We accept the null hypothesis when it is not true. The probability of Type II error = 𝛽
Level of Significance (𝜢)
The maximum probability of committing a Type I error.
1 and 2-tailed test
1-tailed Test
Indicates that the null hypothesis should be rejected when the test value is in the critical region
on one side.
Left tailed test
When the critical region is on the left side of the distribution of the test value.
The Alternative Hypothesis π»π‘Ž ∢ πœ‡ < π‘£π‘Žπ‘™π‘’π‘’
Figure: Left tail test
Right tailed test
When the critical region is on the right side of the distribution of the test value.
The Alternative Hypothesis π»π‘Ž ∢ πœ‡ > π‘£π‘Žπ‘™π‘’π‘’
Figure: Right tail test
Two tail test
The null hypothesis should be rejected when the test value is in either of two critical regions on
either side of the distribution of the test value.
The alternative hypothesis for a two-tail test is π»π‘Ž ≠ π‘£π‘Žπ‘™π‘’π‘’
Figure: 2-tail test
P- Value
The probability of observing any test statistic that is at least as extreme as the one computed
from a sample, given that the null hypothesis is true.
Finding P-Values:1-tail test
The test statistics for right-tail test is 𝑧 = −1.56. Find P-value. The area to the right of z = 1.56
is 1 − 0.9406 = 0.0594
The P-value is 0.0594.
Finding P-values:2-tail test
The test statistic for a two-tail test is z = -2.63. Find the corresponding P-value.
The area to the left of 𝑧 = −2: 63 is 0.0043
The P-value is 2(0.0043) = 0.0086
Test Decisions with P-values
The decision about whether there is enough evidence to reject the null hypothesis can be made
by comparing the P-value to the 𝛼 value; the level of significance of the test.
ο‚·
If P ≤ 𝛼reject the null hypothesis
ο‚·
If P > 𝛼 fail to reject the null hypothesis
Example
ο‚·
If the P-value of a hypothesis test is 0.0749, at a 0.05 level of significance, we fail to
reject 𝐻0 since P > 𝛼
ο‚·
If the P-value of a hypothesis is 0.0245, at a 0.05 level of significance, we reject 𝐻0 since
𝑃≤𝛼
ο‚·
Write the null and alternative hypothesis
ο‚·
State the level of significance
ο‚·
Identify the sampling distribution
ο‚·
Find the test statistic and standardize it
ο‚·
Calculate the P-value for the test statistic
ο‚·
Make your decision
ο‚·
Interpret your decision
Hypothesis Testing for the Mean (n ≥ 30)
The z-Test for a Mean
The z-test is a statistical test for a population mean. The z-test can be used
If the population is normal and s is known or
When the sample size, n, is at least 30. The test statistic is the sample mean π‘₯Μ… and the
standardized test statistic is z
𝑧=
where 𝜎π‘₯Μ… =
π‘₯Μ… − πœ‡
𝜎π‘₯Μ…
𝜎
√𝑛
Example (1)
A cereal company claims the mean sodium content in one serving of its cereal is no more than
230 mg. You work for a national health service and are asked to test this claim. You find that a
random sample of 52 servings has a mean sodium content of 232mg and a standard deviation of
10 mg. At 𝛼 = 0.05, do you have enough evidence to reject the company's claim?
Solution
ο‚·
Write the null and alternative hypothesis
𝐻0 ∢ πœ‡ ≤ 230π‘šπ‘”
π»π‘Ž ∢ πœ‡ > 230π‘šπ‘”
ο‚·
State the level of significance. 𝛼 = 0.05
ο‚·
Determine the sampling distribution.
Since the sample size is at least 30, the sampling distribution is normal
ο‚·
Find the test statistics and standardize it
𝑧=
𝜎π‘₯Μ… =
𝑧=
ο‚·
𝜎
√𝑛
=
π‘₯Μ… − πœ‡
𝜎π‘₯Μ…
10
√52
= 1.387
232 − 230
= 1.44
1.387
Calculate the P-value for the test statistic
Since this is a right tail test, the P-value is the area found to the right of z = 1:44 in the
normal distribution.
From the table, 𝑃 = 1 − 0.9251 = 0.0749
ο‚·
Make your decision.
Compare the P-value to 𝛼
Since 0.0749 > 0.05, fail to reject 𝐻0 .
ο‚·
Interpret your decision.
There is not enough evidence to reject the claim that the mean sodium content of one
serving of its cereal is no more than 230 mg
Rejection Regions
The set of values for the test statistic that leads to rejection of H0
Critical Values
The values of the test statistic that separate the rejection and non-rejection regions.
Using the critical value to make test decisions
ο‚·
Write the null and alternative hypothesis
ο‚·
State the level of significance
ο‚·
Identify the sampling distribution
ο‚·
Find the critical value
ο‚·
Find the rejection region
ο‚·
Find the test statistic and standardize it
ο‚·
Make your decision
ο‚·
Interpret your decision
Example
From example (1), the critical value at 𝛼 = 0.05 is 1.645 and the standardize test statistic
𝑧=
𝜎π‘₯Μ… =
𝑧=
𝜎
√𝑛
=
π‘₯Μ… − πœ‡
𝜎π‘₯Μ…
10
√52
= 1.387
232 − 230
= 1.44
1.387
But 𝑧 = 1.44 does not fall in the rejection region, so we fail to reject H0
Hypothesis Testing for the Mean (n < 30)
The t Sampling Distribution
Find the critical value 𝑑0 for a left-tailed test given α= 0.01 and n = 18
𝑑. 𝑓 = 18 − 1 = 17
𝑑0 = −2.567
Find the critical values −𝑑0 and 𝑑0 for a two tailed test given 𝛼 = 0: 05 and n = 11
𝑑. 𝑓 = 11 − 1 = 10
𝑑0 = −2: 228 and 𝑑0 = 2.228
Example
A university says the mean number of classroom hours per week for full-time faculty is 11.0. A
random sample of the number of classroom hours for full-time faculty for one week is listed
below.
You work for a student organization and are asked to test the claim. A 𝛼 = 0.01, do we have
enough evidence to reject the university claim?
11.8 8.6 12.6 7.9 6.4 10.4 13.6 9.1
Solution
ο‚·
Write the null hypothesis
𝐻0 ∢ πœ‡ = 11.0 π»π‘Ž ∢ πœ‡ ≠ 11.0
ο‚·
State the level of significance. 𝛼 = 0: 01
ο‚·
Determine the sampling distribution
Since the sample size is 8, the sampling distribution is a t-distribution with 8 − 1 = 7
degree of freedom
Since π»π‘Ž contains the ≠ symbol, this is a two tailed test.
ο‚·
Find the critical values at 𝛼 = 2 since it is a two tail test −𝑑0 = −3.499 π‘Žπ‘›π‘‘ 𝑑0 =
3.499
ο‚·
Find the rejection region
ο‚·
Find the test statistics and standardize it.
𝑛 = 8 π‘₯Μ… = 10.05 𝑠 = 2.485 πœ‡ = 11.00
𝑑=
𝜎π‘₯Μ… =
𝑑=
ο‚·
𝜎
√𝑛
=
π‘₯Μ… − πœ‡
𝜎π‘₯Μ…
2.485
√8
= 0.87858
10.050 − 11.0
= −1.08
0.87858
Make your decision 𝑑 = −1.08 does not fall in the rejection in region so fail to reject 𝐻0
at 𝛼 = 0.01
ο‚·
Interpret your decision.
There is not enough evidence to reject the university's claim that full time faculty spend a
mean of 11 classroom hours.
Hypothesis Testing for Proportions
p is the population proportion of successes. The test statistic is
π‘₯
𝑝̂ = 𝑛, the proportion of sample successes.
If 𝑛𝑝 ≥ 5 and π‘›π‘ž ≥ 5 the sampling distribution for 𝑝̂ is normal
Test Statistics
The standardized test statistic is 𝑧 =
𝑝̂−𝑝
π‘π‘ž
𝑛
√
Example
A communications industry spokesperson claims that over 40% of Africans either own a cellular
phone or have a family member who does. In a random survey of 1036 Africans, 456 said they or
family member owned a cellular phone. Test the spokes person's claim at 𝛼 = 0.05.
Solution
ο‚·
Write the null and alternative hypothesis.
𝐻0 ∢ 𝑝 ≤ 0.40 π»π‘Ž > 0: 40
ο‚·
State the level of significance.𝛼 = 0: 05
ο‚·
Determine the sampling distribution.
1036(.40) > 5 π‘Žπ‘›π‘‘ 1036(.60) > 5 The sampling distribution is normal
ο‚·
Find the Critical value
critical value = 1.645
ο‚·
Find the test statistic and standardize it
𝑛 = 1036 π‘₯ = 456 𝑝̂ =
𝑧=
ο‚·
44 − 40
√(. 40)(.60)
1036
=
π‘₯
456
=
= 44
𝑛 1036
0.04
= 2.63
0.1522
Make your decision.
z = 2.63 falls in the rejection region, so reject 𝐻0
Hypothesis Testing for Variance and Standard Deviation
𝑠 2 is the test statistic for population variance. Its sampling distribution is a πœ’ 2 distribution with
n- 1 degree of freedom.
Test Statistics
The standardized test statistic is
πœ’2 =
(𝑛 − 1)𝑠 2
𝜎2
Example
A state administrator says that the standard deviation of test scores for 8th grade students who
took a life-science assessment test is less than 30. You work for the administrator and are asked
to test this claim. You find that a random sample of 10 tests has a standard deviation of 28.8. At
𝛼 = 0.01, do you have enough evidence to support the administrator's claim? Assume test
scores are normally distributed.
Solution
ο‚·
Write the null and alternative hypothesis.
𝐻0 ∢ 𝑝 ≥ 30 π»π‘Ž < 30
ο‚·
State the level of significance.𝛼 = 0.01
ο‚·
Determine the sampling distribution.
ο‚·
The sampling distribution is πœ’ 2 with 10 - 1 = 9 degree of freedom
ο‚·
Find the Critical value. critical value = 2.088
ο‚·
Find the test statistic
𝑛 = 10 𝑠 = 28.8
πœ’2 =
ο‚·
(𝑛 − 1)𝑠 2 (10 − 1)(28.8)2
=
= 8.2944
𝜎2
302
Make your decision.
πœ’ 2 = 8.2944 does not fall in the rejection region, so fail to reject 𝐻0
ο‚·
Interpret your decision.
There is not enough evidence to support the administrator's claim that the standard
deviation is less than 30.
Correlation and Regression
Correlation
A relationship between two Variables.
Types of correlation
ο‚·
Negative Correlation
ο‚·
Positive Correlation
ο‚·
No linear Correlation
Correlation Coefficient
A measure of the strength and direction of a linear relationship between two variables.
π‘Ÿ=
𝑛 ∑ π‘₯𝑦 − ∑ π‘₯ ∑ 𝑦
√𝑛 ∑ π‘₯ 2 − (∑ π‘₯)2 √𝑛 ∑ 𝑦 2 − (∑ 𝑦)2
The range of r is from -1 to 1
ο‚·
If r is close to -1, there is a strong negative correlation.
ο‚·
If r is close to 0, there is no correlation
ο‚·
If r is close to 1, there is a strong positive correlation
Exercise
Fit a regression line and find the correlation between absence and final grade in the data below
Absences (x) Final Grade (y)
8
78
2
92
5
90
12
58
15
43
9
74
6
81
Hypothesis Test for Significance
r is the correlation coefficient for the sample. The correlation coefficient for the population is 𝜌
(rho).
For a two tail test for significance:
𝐻0 ∢ πœ‡ = 0 (The correlation is not significant)
π»π‘Ž ∢ πœ‡ ≠ 0 (The correlation is significant)
The sampling distribution for r is a t-distribution with n - 2 d.f.
The Standardized test statistic is
𝑑=
π‘Ÿ−0
=
πœŽπ‘Ÿ
π‘Ÿ
2
√1 − π‘Ÿ
𝑛−2
Example
The correlation between the number of times absent and a final grade is r = -0.975. There were
seven pairs of data. Test the significance of this correlation. use 𝛼 = 0: 01
Exercises
The height and weight of 10 boys were measured and the results are given by the table below.
Find the correlation coefficient between the height (cm) and the weight (kg). Is the correlation
significant at 5%?
Wt 38 39
43
44
35
32
31
42
49
41
Ht 150 152
146
158
142
144
135
145
155
150
In each of the following, find the correlation coefficient and the significant at 5% between y and
x from the information given:
n
∑π‘₯
∑𝑦
∑ π‘₯2
∑ 𝑦2
∑ π‘₯𝑦
i) 12
129
63
1500
800
700
ii) 9
56
76
500
830
620
436
585
31218 1858
iii) 10 69
Linear Regression
Once you know there is a significant linear correlation, you can write an equation describing the
relationship between the x and y variables. This equation is called the line of regression or least
squares line.
The equation of a line may be written as y = mx + b where m is the slope of the line and b is the
y-intercept.
The line of regression is
𝑦̂ = π‘šπ‘₯ + 𝑏
The slope m is:
π‘š=
𝑛 ∑ π‘₯𝑦 − ∑ π‘₯ ∑ 𝑦
𝑛 ∑ π‘₯ 2 − (∑ π‘₯)2
The intercept is
𝑏 = 𝑦̅ − π‘šπ‘₯Μ…
Exercises
1. Find the line of regression of y on x from the following values:
ο‚·
n= 5, ∑ π‘₯ = 10, ∑ π‘₯ 2 = 30, ∑ 𝑦 = 13.1, ∑ 𝑦 2 = 54.41, ∑ π‘₯𝑦 = 40.3
ο‚·
n= 6, ∑ π‘₯ = 15, ∑ π‘₯ 2 = 55, ∑ 𝑦 = 6.4, ∑ 𝑦 2 = 8.06, ∑ π‘₯𝑦 = 20.4
ο‚·
n= 5, ∑ π‘₯ = 20, ∑ π‘₯ 2 = 90, ∑ 𝑦 = 18.7, ∑ 𝑦 2 = 75.77, ∑ π‘₯𝑦 = 82.3
ο‚·
n= 5, ∑ π‘₯ = 15, ∑ π‘₯ 2 = 55, ∑ 𝑦 = 77, ∑ 𝑦 2 = 1503, ∑ π‘₯𝑦 = 177
2. For each of the situations in question 1, estimate the value of y when x = 10
Measures of Regression and Correlation
The Coefficient of Determination
The coefficient of determination, r 2, is the ratio of explained variation in y to the total variation
in y
π‘Ÿ2 =
𝐸π‘₯π‘π‘™π‘Žπ‘–π‘›π‘’π‘‘ π‘£π‘Žπ‘Ÿπ‘–π‘Žπ‘‘π‘–π‘œπ‘›
π‘‡π‘œπ‘‘π‘Žπ‘™ π‘£π‘Žπ‘Ÿπ‘–π‘Žπ‘‘π‘–π‘œπ‘›
The correlation coefficient of number of times absent and final grade is r = -0.975. The
coefficient of determination is
π‘Ÿ 2 = (−0.975)2 = 0: 9506
Interpretation
About 95% of the variation in final grades can be explained by the number of times a student is
absent. The other 5% is Unexplained and can be due to sampling error or other variables such as
intelligence, amount of time studied, etc
The Standard Error of Estimate
The Standard Error of Estimate, 𝑠𝑒 is the standard deviation of observed yi values about the 𝑦̂
predicted value.
2
∑(𝑦𝑖 − 𝑦̂)
𝑖
√
𝑠𝑒 =
𝑛−2
Chi-Square Test of Goodness of Fit
A chi-square distribution is skewed right and are not symmetric.
The value of πœ’ 2 ≥ 0
A chi-square goodness-of-fit test is used to test whether a frequency distribution
Chi-Square Test
If the observed frequencies are obtained from a random sample and each expected frequency is
at least 5, the sampling distribution for goodness-of-fit test is chi-square distribution with k -1
degrees of freedom (where k = the number of categories).
The test statistic is
πœ’2 = ∑
(𝑂 − 𝐸)2
𝐸
O = observed frequency in each category
E = expected frequency in each category
Example
A social service organization claims 50% of all marriages are the first marriage for both bride
and groom, 12% are first for bride only, 14% for the groom only and 24% a remarriage for both.
The results of a study of 103 randomly selected married couples listed in the table. Test the
distribution claimed by the agency. Use 𝛼 = 0.01
First Marriage
f
Bride and Groom
55
Bride only
12
Groom only
12
Neither
24
Solution
ο‚·
Write the null and alternative hypothesis.
H0 : The distribution of first-time marriages is 50% for both bride and groom, 12% for
bride only, 14% for groom only, 24% are remarriages for both.
Ha : The distribution of first-time marriages differs from claimed distribution.
ο‚·
State the level of significance. 𝛼 = 0: 01
ο‚·
Determine the sampling distribution
A chi-square distribution with 4 - 1 = 3 d.f
ο‚·
Find the critical value.
Critical value = 11.34
ο‚·
Find the test statistic
πœ’2 = ∑
(𝑂 − 𝐸)2
𝐸
(𝑂−𝐸)2
First Marriage
%
O
E
(𝑂 − 𝐸)2
Bride and Groom
50
55
51.5
12.25
Bride only
12
12
12.36 0.1296
0.0105
Groom only
14
12
14.42 5.8564
0.4061
Neither
24
24
24.72 0.5184
0.0210
Total
100
103
103
0.6755
𝐸
0.2379
πœ’ 2 = 0.6755
ο‚·
Make decision
The test statistic 0.6755 does not fall in the rejection region, so we fail to reject H0
ο‚·
Interpret your decision
The distribution fit the specified for first time marriages
Exercise
A die is rolled 120 times. The results are given in the table below.
Check whether or not it is biased.
Number 1 2 3 4 5 6
Frequency 14 13 21 16 27 29
Comparing Two Variances
Two Sample Test for Variances
To compare population variances, 𝜎12 and 𝜎22 use the F- distribution.
Let 𝑠12 and 𝑠22 represent the sample variances of two different populations. If both populations are
normal and the population variances, 𝜎12 and 𝜎22 are equal, then the sampling distribution is
called an F- distribution. 𝑠12 always represents the larger of the two variances
𝐹=
𝑠12
𝑠22
Analysis of Variance
One Way Analysis of Variance (ANOVA)
This is a hypothesis testing technique that is used to compare means from three or more
population.
𝐻0 ∢ πœ‡1 = πœ‡2 = πœ‡3 = β‹― = πœ‡π‘˜ (All population means are equal.)
π»π‘Ž : at least one of the means is different from the others.
The variance is calculated in two different ways and the ratio of the two values is formed.
𝐹=
𝑀𝑆𝐡
π‘€π‘†π‘Š
MSB, Mean Square Between, the variance between samples, measures the differences related to
the treatment given to each sample.
MSW, Mean Square Within, the variance within samples, measures the differences related to
entries within the same sample. The variance within samples is due to sampling error
Mean Square Between
Each group is given a different "treatment". The variation from the grand mean (mean of all
values in all small groups) is measured. The treatment (or factor) is the variable that
distinguishes members of one sample from another.
First calculate SSB and divide by k -1, the degrees of freedom.
(k = the number of treatments or factors.)
2
𝑆𝑆𝐡 = ∑ 𝑛𝑖 (π‘₯̅𝑖 − π‘₯Μ… )
∑ 𝑛𝑖 (π‘₯̅𝑖 − π‘₯Μ… )2
𝑆𝑆𝐡
𝑀𝑆𝐡 =
=
π‘˜−1
𝑛−1
Mean Square Within
Calculate SSW and divide by N - k, the degree of freedom.
π‘†π‘†π‘Š = ∑(𝑛𝑖 − 1)𝑠𝑖2
𝑀𝑆𝑀 =
∑(𝑛𝑖 − 1)𝑠𝑖2
𝑆𝑆𝑀
=
𝑁−π‘˜
𝑁−π‘˜
If MSB is close in value to MSW, the variation is not attributed to different effects the different
treatments have on the variable. The ratio of two measures (F-ratio) is close to 1
If MSB is significantly greater than MSW, the variation is probably due to differences in the
treatments or factors, and the F-ratio will differ significantly from 1
Example
The table below shows the annual amount spent on reading (in $) for a random sample of some
consumers from four regions. At 𝛼 = 0.10, can you conclude that the mean annual amounts
spent are different?
Northeast
Midwest
South West
308
246
103
223
58
169
143
184
141
246
164
221
109
158
119
269
220
167
99
199
144
76
214
171
108
204
316
Solution
Write the null and alternative hypothesis.
𝐻0 ∢ πœ‡1 = πœ‡2 = πœ‡3 = πœ‡4
Ha : At least one of the means is different from the others
State the level of significance
= 𝛼 0.10
Determine the sampling distribution.
An F distribution with
𝑑. π‘“π‘π‘’π‘šπ‘’π‘Ÿπ‘Žπ‘‘π‘œπ‘Ÿ = 3 and 𝑑. π‘“π·π‘’π‘›π‘œπ‘šπ‘’π‘›π‘Žπ‘‘π‘œπ‘Ÿ = 23
Find the critical value.
The critical value is 2.34
Find the test statistic
𝐹 =
𝑀𝑆𝐡
π‘€π‘†π‘Š
Northeast
Midwest
South
West
308
246
103
223
58
169
143
184
141
246
164
221
109
158
119
269
220
167
99
199
144
76
214
171
108
204
316
π‘₯Μ…
185.14
177.00
135.71
210.14
𝑠2
9838.66
4050.05
1741.39
1020.80
π‘₯Μ… =
4779
= 177
27
Mean Square Between
2
𝑆𝑆𝐡 = ∑ 𝑛𝑖 (π‘₯̅𝑖 − π‘₯Μ… )
Mean
n
(π‘₯̅𝑖 − π‘₯Μ… )2
𝑛𝑖 (π‘₯̅𝑖 − π‘₯Μ… )2
185.17
7
66.26
463.8
177.00
6
0.00
0.0
135.71
7
1704.86
11934.0
210.14
7
1098.26
7687.6
𝑀𝑆𝐡 =
𝑆𝑆𝐡
20086
=
= 6695.33
π‘˜−1
3
Mean Square Within
π‘†π‘†π‘Š = ∑(𝑛𝑖 − 1)𝑠𝑖2
n
𝑠2
(𝑛𝑖 − 1)𝑠 2
7
9838.66
59031.9
6
4050.05
20250.2
7
1741.39
10448.4
7
1020.80
6124.8
π‘€π‘†π‘Š =
𝐹 =
95855
= 4167.61
23
6955.33
= 1.669
4167.61
Make your decision
Since F = 1.669 does not fall in the rejection region, fail to reject the null hypothesis
Interpret your decision.
There is not enough evidence to support the claim that the means are not equal. Expenses for
reading are the same for all the regions.
Download