Uploaded by Destinee Everett- Scott

Basic Statistics: Introduction & Measures of Center

advertisement
Section A
Basic Introduction to Statistics
What do you think of when you see/hear the word statistics? The majority of people immediately think of
numerical facts, data, graphs and tables. But not only do statisticians collect, classify and tabulate data,
they also analyze data in order to make generalizations and decisions.
Why study statistics?
1) Everyone comes in contact with statistics in everyday life.
2) People should understand reports in newspapers, magazine and journals.
3) People should be able to question the statistics they read, and not blindly accept these as proven fact.
4) Many areas of study use statistics, such as; psychology, sociology, business, biology, government,
engineering, science education and even areas such as history, language and the arts.
Statistics is the science of collecting, organizing, summarizing, and analyzing data to draw conclusions or
answer questions. It also provides a measure of confidence in any conclusions.
Two types of statistics:
1) Descriptive statistics: the use of numbers to summarize information which is known about some
population. [collecting, organizing and summarizing the data]
2) Inferential statistics: the use of numbers related to a random sample from a population to give
numerical information about the population itself. [analyzing the data to draw
conclusions or answer questions about the population]
Example:
Determine which of the following is an example of descriptive statistics and which is an example of inferential
statistics.
a) The average weight of all football players on the NY Giants football team is 235 pounds.
b) The average yearly salary of a random sample of 150 minor league baseball players is $102,000.
Therefore, the average yearly salary of all minor league baseball players is $102,000.
Probability is the measure of the likelihood that something happens/occurs and is very important in
inferential statistics; it’s related to the risk of making an error.
A variable is a characteristic that describes a person, place or thing being studied.
Example: height, gender, weight, color
1
A raw score is an unaltered measurement obtained in a particular situation. It is the raw information from
which statistics are created.
A distribution is a collection of raw scores.
Examples of distributions:
Population – All people or things being considered in a particular situation
EX:
A parameter is a numerical value that summarizes or describes the whole population
EX:
Sample – any portion (subset) of a population under consideration
EX:
A statistic is a numerical value that summarizes or describes a sample
EX:
Note: A parameter goes with a population and a statistics goes with a sample.
Examples:
1) To determine the average GPA of 500 students who just finished their first year in college, a group of 60
students is randomly selected. It is determined that the average GPA is 2.85.
a) What is the population for this study?
b) What constitutes the sample?
c) Based on the sample, what is the statistic for the average GPA of the population?
2
2) Determine whether the number described is a parameter or statistic:
a) In a recent survey of college graduates, 68% of those who responded said they had more than $50,000
in student loans.
b) The average age of all the employees working at XYZ Company is 37 years.
c) The average GPA of 250 randomly selected students at ABC University is 2.73.
d) Of all the students attending Mercer County Community College in 2018, 66% were part time
students.
Random Sample: A sample selected in such a way that every member of the population has the same
probability of being selected for the sample. A sample chosen at random is meant to be an unbiased
representation of the total population. Note: the word “random” describes the process by which the
sample is chosen and does not guarantee that the sample will be representative, but it allows us to
determine the probability the sample is representative.
Consider the following: Population: All students attending Mercer County Community College
Variable: Some measure of mathematical ability
Sample: Students leaving a section of calculus at MCCC.
This is not a random sample from the population of all students at MCCC. From this sample we should
not attempt to infer anything about the mathematical ability of all students at MCCC.
Note: A bias in obtaining a sample will destroy the value of the statistical information obtained since
statistical inferences made from this information would be invalid. That is why it is important to
use random samples when doing statistical analysis.
Example:
Determine whether or not the sample given represents the given population accurately.
a) Population: All students attending Mercer County Community College.
Sample: 100 students selected at random entering the student center at noon on Monday.
b) Population: All businesses in Mercer County.
Sample: 75 businesses selected at random from a list of all businesses in Mercer County.
Why use a sample instead of a population?
a)
b)
c)
d)
3
Two types of variables
1) Qualitative or categorical variable – classification based on some attribute or characteristic of the
individual (non-numerical)
EX:
2) Quantitative variable – provides numerical measures of individuals
Two types of quantitative variables
1) Discrete – has either a finite number of possible values or a countable number of possible values
(something that can be counted)
EX:
2) Continuous – has an infinite number of possible values that are not countable
(something that can be measured)
EX:
Example:
The following data set provides information about five college professors.
Name
Allen
Backer
Hughes
Ramirez
Turner
Specialty
Gender
Nursing
F
Accounting
M
Psychology
F
Mathematics
F
Sociology
M
Age
40
59
52
30
38
Height (in)
65
71
69
68
70
# of years of teaching
16
34
13
5
9
Rank
Associate Professor
Full Professor
Associate Professor
Assistant Professor
Assistant Professor
Which variables are qualitative and which are quantitative variables?
Qualitative variables:
Quantitative variables:
4
Section A: Homework
1) Determine which of the following is an example of descriptive statistics and which is an example of
inferential statistics.
a) The average height of all faculty at MCCC is 5 feet 11 inches.
b) The average IQ of a random sample of 100 students at MCCC is 105. Therefore, the average IQ
of all MCCC students is approximately 105.
2) Determine whether the number described is a parameter or a statistic.
a) The average height of all football players on the Eagles Football team in 2016 was 73.72 inches.
b) In a survey of 1000 college students, 37% believe they will have difficulty finding a job in their
major field after graduation.
c) In a recent poll, the average age of the respondents was 43 years.
d) The average number of hours all students at Ivy University study per day is 2.36 hours.
3) To determine the average typing speed of 700 students who just finished Typing 101, a group of 20
students is randomly selected. It is determined that the average typing speed is 47 words per minute.
a. What is the population for this study?
b. What constitutes the sample?
c. Based on the sample, what is the statistic for the average typing speed of the population?
4) Determine whether or not the sample given represents the given population accurately.
a) Population: All students in MAT125 this semester.
Sample: Every 5th name from a list of all students in MAT125 this semester.
b) Population: All residents of Mercer County.
Sample: 50 people are selected at random of those who live in Hamilton Square.
c) Population: All residents of New Jersey.
Sample: Selection of names at random from all New Jersey residents.
5
5) Which are qualitative and are which are quantitative (If quantitative, state discrete or continuous)?
a) The number of people attending a Trenton Thunder Baseball game.
b) Your cell phone number.
c) The seating capacity of a football stadium.
d) The amount of electricity used by a household during a given month.
e) The name of your favorite movie.
f) The amount of time you wait to see a doctor.
g) The number of red cars in the MCCC west student parking lot.
6
Section B
Measures of Center
Descriptive statistics consists of methods to organize and summarize data clearly and effectively.
Organizing and summarizing the data is useful, since it helps the researcher see the important aspects of
the data collected. One way to describe a data set is to find numerical summaries of the data. The first
type of numerical summary are called measures of center; these values describe the center of the data.
Three Measures of Center:
1) Mean: balance, center of gravity, equal weight on each side of the mean
2) Median: cuts the data set in half, 50% on one side of the median 50% on the other
3) Mode: Most frequent value(s) in a distributed; used with both qualitative and quantitative variables
1) Mean – the mean is the point at which the data set would balance,
affected by extreme values in the data set (nonresistant)
Notation: Population mean: μ =
∑x
N
sample mean : x
̅=
∑x
n
where N is the population size and
n is the sample size
Note: Σx means add up the data values
Example: 14 18 34 26 31 56 45 48 23
2) Median: Need to put data in ascending order.
Not affected by extreme values in the data set (resistant).
If the data set is odd, the median is the middle data value.
If the data set is even, the median is the sum of the middle two numbers divided by 2.
Example: Odd number of data values:
45 47 52 54 58 63 65 67 73 75 79
Even number of data values: 56 59 62 64 65 67 68 69 70 74
7
Comparing the mean and median to determine the shape of the distribution:
Right Skewed: mean > median
Left Skewed: mean < median
Symmetric: mean = median
Graphical representation:
Left Skewed
Right Skewed
Symmetric (bell-shaped)
Examples:
1) 3 4 5 6 7 8 9
2) 11 12 15 16 18 24 32
3) 1 2 8 9 10 11 12 13 14
3) Mode: The mode of a data set is the value that appears most frequently.
If two or more values are tied for the most frequent, they are all considered to be modes.
If the values all have the same frequency, we say that the data set has no mode.
Examples:
1) red red blue blue yellow yellow green red blue green blue blue
2) 45 46 34 36 53 55 54 54 32 36 64 49 50
8
More Examples:
1) Given the following set of numbers:
209 214 220 224 224 229 239 241 245 246 247 249 255
Find:
a) n =
b) ∑ x = ______ c) Mean = ______ d) Median = ________ e) Mode = _________
f) Is the distribution right skewed, left skewed or symmetric? _______________________
2) Given the following set of numbers : 10 12 14 15 17 21 22 23
a) n = _________ b) ∑ x = ________ c) x̅ = _______ d) ∑(x − 16.75)= ____________
e) If 60 is added to each data value of the above distribution, what will the mean of the resulting
distribution equal? __________
f) If each data value of the above distribution is multiplied by 5, what will the mean of the resulting
distribution equal? ___________
3) Given the following table:
Ice Cream Favor
Frequency
Chocolate
10
Strawberry
7
Vanilla
16
What is the mode?___________
9
Section B: Homework
1) Given the following set of numbers:
22 28 36 42 42 42 52 53 54 55 56 57 58 59
Find: a) n = ______ b) ∑ 𝐱 = _______ b) Mean = ________ c) Median = _______ d) Mode = _______
e) Is the distribution right skewed, left skewed or symmetric? _________________
2) Given the following set of numbers:
124 126 127 128 130 132 132 139 144 148 149 155 170
Find: a) n = ______ b) ∑ 𝐱 = _______ b) Mean = ________ c) Median = _______ d) Mode = _______
e) Is the distribution right skewed, left skewed or symmetric? _________________
3) Observations of cars with frequencies in a parking lot:
Make of Car
Frequency
Ford
23
Chevrolet
30
Kia
19
Pontiac
2
Buick
2
Cadillac
3
Mercury
4
Lincoln
2
Volkswagen
21
What is the mode?_________________
4) Given the following set of numbers : 2 3 4 5 5 6 10
a) n = _________ b) ∑ x = ___________ c) x̅ = ___________ d) ∑(x − 5)= __________
e) If 350 is added to each term of the above distribution, what will the mean of the resulting
distribution equal? _________________
5) Given the following distribution of test scores:
85 72 9 83 81 85 93 0 82 85
a) Calculate the: mean = ___________, median = ____________, mode = ____________
b) Which of the three averages seems most meaningful in this situation? ___________ Why?
6) Professors were observed wearing the following colored shirts on a particular day. Find the “average”
of this distribution and state which measure of central tendency it is:
Color
Frequency
White
1
Blue
9
Yellow
4
Red
2
Pink
1
10
Section C
Measures of Spread
Another type of descriptive summary is called measures of spread or measures of variation. Measures of
spread summarize the data in a way that shows how scattered the values are from each other and how
much they differ from the mean value. Just as there are different measures of center there are different
measures of spread such as range, variance, and standard deviation. Note, two data sets can have the
same mean, median or mode, but be very different in their measure of spread, therefore it important to
summarize the data using both a measure of center and a measure of spread.
1) The range of a data set is the difference between its largest value and its smallest value.
Range = largest value – smallest value
In using the range, a great deal of information is ignored since only the largest and smallest values are
used to calculated the range, so the range is not a measure of spread used often in summarizing data.
2) Variance – is a measure of how far the values in a data set are from the mean, on average
Deviation – the difference between a value and the mean, μ. deviation = x – μ
If the deviation is positive the value lies above the mean.
If the deviation is negative the value lies below the mean.
The sum of the deviations equals zero: ∑(x − μ) = 0. Recall, the fact that the mean is the value where
the data would balance, it makes sense that the deviations on either side of the mean would cancel each
other out.
Formula for population variance:
2
σ =
∑(x−μ)2
N
=
∑ x2
N
− μ2 where µ is the population mean
and N is the population size
When the data values come from a sample rather than a population, the variance is called the sample
variance. The procedure for computing the sample variance is a bit different from the one used to
compute a population variance. In the formula, the population mean μ is replaced by the sample mean
x̅ and the denominator is n − 1 (n is the sample size) instead of N (the population size). The sample
variance is denoted by 𝑠 2 .
Formula for sample variance:
2
s =
∑(x−x̅)2
n−1
=
n ∑ x2 −(∑ x)2
n(n−1)
where x̅ is the sample mean
and n is the sample size
When computing the sample variance, s2, we use the sample mean, x̅ , to compute the deviations. For
the population variance, σ2, we use the population mean, µ, for the deviations. It turns out calculating
the deviations using the sample mean tend to be a bit smaller than the deviations using the population
mean. If we were to divide by 𝑛 when computing a sample variance, the value would tend to be a bit
smaller than the population variance.
11
It can be shown mathematically that the appropriate correction is to divide the sum of the squared
deviations by n − 1 rather than n.
Because the variance is computed using squared deviations, the units of the variance are the squared
units of the data. In most situations, it is better to use a measure of spread that has the same units as the
data.
We do this simply by taking the square root of the variance. This quantity is called the standard deviation.
The standard deviation of a population is denoted,
σ,
and the standard deviation of a sample is denoted, s.
population standard deviation: σ
= √σ2
sample standard deviation: s
= √s2
In other words, the computational formula for sample standard deviation is as follows:
s=
n ∑ x2 −(∑ x)2
√ n(n−1)
Note: ∑ x 2 ≠ (∑ x)2
For example: Given: 1, 2, 3, 4
∑ x = 1 + 2 + 3 + 4 = 10 so (∑ x)2 = 102 = 100
∑ x 2 = 12 + 22 + 32 + 42 = 30 and 30 ≠ 100
1) For the following set of numbers:
10 12 14 17 18 20
Find:
a) n = _______
b) ∑ x =____________
c) ∑ x 2 = __________
d) s2 = _____________
e) s = _____________
12
2) For the following set of numbers: 45 34 29 31 54 42 37 32
Find:
a) n = _______
b) ∑ x =____________
c) ∑ x 2 = __________
d) s2 = _____________
e) s = _____________
3) For the following set of numbers: 15 16 25 29 32 39 41 48 46 47
Find:
a) n = _______
b) ∑ x =____________
c) ∑ x 2 = __________
d) s2 = _____________
e) s = _____________
f) Add 20 to each of the data values above, what is the new variance, s2 = __________
what is the new standard deviation, s = __________
g) Multiply each data value by 12, what is the new variance, s2 = _____________
what is the new standard deviation, s = ______________
13
If a data set is approximately bell-shaped, the mean and standard deviation together can provide an
approximate description of the data using the following rule:
The Empirical Rule: When a population has a histogram that is approximately bell-shaped then

Approximately 68% of the data will be within one standard deviation of the mean.
(μ − σ, μ + σ) or (x̅ − s, x̅ + s)

Approximately 95% of the data will be within two standard deviations of the mean.
(μ − 2σ, μ + 2σ) or (x̅ − 2s, x̅ + 2s)

Approximately 99.7% of the data will be within three standard deviations of the mean.
(μ − 3σ, μ + 3σ) or (x̅ − 3s, x̅ + 3s)
Examples:
1) IQ scores are approximately bell-shaped with a mean of 100 and standard deviation of 15.
a) Between what two values will approximately 95% of the IQ scores be within?__________
b) About what percent of the IQ scores is between 85 and 115? __________
c) About what percent of the IQ scores is between 55 and 145? __________
2) The heights of 2-year old girls are approximately bell-shaped with a mean of 34 inches and a standard
deviation of 2.5 inches.
a) About what percent of the heights is between 29 and 39 inches? _________
b) About what percent of the heights is between 31.5 and 36.5 inches? _________
c) Between what two values will approximately 99.7% of the heights be within?________
14
Section C: Homework
1) Given the following set of numbers:
10 12 15 17 19 22
Find:
a) n = _______
b) ∑ x =____________
c) ∑ x 2 = __________
d) s2 = _____________
e) s = _____________
f) Add 5 to each of the data values above, what is the new variance, s 2 = __________
what is the new standard deviation, s = ___________
g) Multiply each data value by 10, what is the new variance, s2 = _____________
what is the new standard deviation, s = ______________
2) For the following set of numbers: 53 62 57 54 63 67 70
Find:
a) n = _______
b) ∑ x =____________
c) ∑ x 2 = __________
d) s2 = _____________
e) s = _____________
f) Add 50 to each of the data values above, what is the new variance, s2 = __________
what is the new standard deviation, s = __________
g) Multiply each data value by 8, what is the new variance, s2 = _____________
what is the new standard deviation, s = ______________
15
3) Last year the mean salary for professors in a particular community college was $62,000 with a standard
deviation of $2000. A new two year contract is negotiated.
a) In the first year of the contract, each professor receives a $1500 raise. Find the mean and standard
deviation for the first year of the contract.
b) In the second year of the contract, each professor receives a 3% raise based on their salary during the
first year of the contract. Find the mean and the standard deviation for the second year of the contract.
4) The mean time a cell phone battery will hold its charge during moderate use is 650 minutes with a
standard deviation of 75 minutes. Assume the data is approximately bell-shaped.
a) Between what two values will approximately 95% of data fall? __________
b) About what percent of data is between 575 and 725? __________
c) About what percent of data is between 425 and 875? _________
5) Monthly electric bills in New Jersey for a bell-shaped distribution with a mean of $109 and a standard
deviation of $37.
a) About what percent of the bills will be between $72 and $146? ___________
b) Between what two values will approximately 99.7% of the bills fall?_________
c) Between what two values will approximately 95% of the bills fall?____________
16
Section D
Measures of Position
Another way to summarize a data set it to determine where data values lie within the data set. These
measures of position are z-scores, percentiles, and quartiles.
z-scores indicate the location of a value with respect to the mean of a data set. The z-score of a value
expresses how many standard deviations above or below the value is from the mean. In addition, zscores provide a way to compare data sets which have different means and standard deviations.
A z-score is calculated using the following formulas depending on whether you are using a population or
a sample:
Population
z=
sample
x− μ
σ
or
z=
x− x̅
s
If the z-score is positive (+), the value is above the mean.
If the z- score is negative (−), the value is below the mean.
The mean for distribution of z-scores is equal to zero and the standard deviation equal to 1.
The mean for any distribution would have a z-score of 0.
Solving the above equations for x:
x = μ + zσ or
x = x̅ + zs
Examples:
1) Suppose you have a data set in which the population mean = µ = 45 and the population standard
deviation = σ = 4
a) Find the z-score for 52.
b) Find the z-score for 32.
c) Find the x-value that corresponds to a z-score of −2.25.
d) Find the x-value that corresponds to a z-score of 1.87.
17
2) Suppose you have a data set in which the sample mean = x̅ = 75 and the sample standard
deviation = s = 10.6
a) Find the z-score for 57.
b) Find the z-score for 95.
c)
Find the x-value that corresponds to a z-score of 2.65.
d) Find the x-value that corresponds to a z-score of –2.38.
3) Before applying to colleges. Jimmy took both the SATs and the ACTs. He scored a 1250 on the SATs
and a 28 on the ACTs. The mean and standard deviation for the SATs are 1059 and 210, respectively. The
mean and standard deviation for the ACTs are 20.6 and 5.8, respectively.
On which exam did he do relatively better on? Why?
4) During the year, Jennifer ran in a full marathon as well as a half marathon. She ran the full marathon in
262 minutes and the half marathon in 129 minutes. The mean and standard deviation for the full
marathon are 287 minutes and 45 minutes, respectively. The mean and standard deviation for the half
marathon are 143 minutes and 18 minutes, respectively. In which marathon did she do relatively better
in? Why?
18
Percentiles indicate what percent of the values in the data set are below a particular data value.
The kth percentile, denoted Pk, of a data set is a value such that k percent of the observations are less than
or equal to the value. Note: The median would be at the 50 th percentile, i.e. median = P50.
Example: Suppose 75 is at the 68th percentile (P68 = 75) this means that 68% of the data values
are less than 75.
Suppose 62 is at the 42nd percentile (P42 = 62) this means that 42% of the data values
are less than 62.
1) In a particular county, records indicated the assessed value of each of the 150,000 houses there. The
following percentiles were obtained.
P15 = $120,000
P50 = $175,000
P65 = $210,000
P90 = $255,000
a) What percent of houses were assessed below $120,000?________________
b) What percent of houses were assessed below $255,000?________________
c) What percent of houses were assessed between $120,000 and $210,000?_______________
d) What percent of houses were assessed above $210,000?___________________
e) What percent of houses were assessed above $120,000?_________________
f) How many houses were assessed above $120,000?__________________
g) What is the median value of the assessed houses?____________________
2) Phil Latelist gave a test on the history of stamp collecting to a class of 400 students. Five of the test
scores with the corresponding z-score and percentile are given in the table below:
Test Score
Z-score
Percentile
57
-2
10
65
-1
25
73
0
42
77
0.5
50
85
1.5
70
Answer the following questions about all 400 test scores.
a) What is the mean of the 400 scores? _________ b) What is the median of the 400 scores?_________
c) What is the standard deviation of the 400 scores?___________
d) What percent of the scores lie between 85 and 65? ___________
e) How many scores lie above 57?__________
19
A special type of percentile are the quartiles.
Quartiles are the 25th, 50th and 75th percentiles. Denoted Q1, median, and Q3, respectively. Quartiles
divide the data set into quarters or in other words four parts. Quartiles are used to determine the shape
of a distribution and are used to determine if the data set has what are called outliers or extreme values;
data values that differ significantly from the other observations in the data set.
Example: Given the data set: 45 47 50 53 56 59 62 65 67 74 76 Find the quartiles.
A way to describe a data set using quartiles is called the five-number summary.
The five-number summary consists of the minimum, Q1, median, Q3, maximum written in this order.
[min, Q1, median, Q3, max]
For the data set above find the five-number summary. ______________________________
Interquartile Range (IQR) is the difference between the third and first quartiles of a data set. Note: The
IQR is actually a measure of spread since it is the range of the middle 50% of the observations.
IQR = Q3 – Q1
For the data set above, IQR = _______________
20
Using the quartiles and IQR it can be determined if the data set contains outliers or not. An outlier is a
value that is considerably larger or smaller than most of the values in a data set.
Boundaries (fences) serve as cutoff points for determining outliers:
Possible outlier boundaries: Lower Fence = LF = Q1 – 1.5(IQR)
Upper Fence = UF = Q3 + 1.5(IQR)
Extreme outlier boundaries : Lower Lower Fence = LLF = Q1 – 3(IQR)
Upper Upper Fence = UUF = Q3 + 3(IQR)
Therefore any values that are between the lower lower fence and the lower fence or between the upper
fence and the upper upper fence are considered possible outliers. Any values less than the lower lower
fence and greater than the upper upper fence are considered extreme outliers.
LLF
Extreme
outliers
(−∞, LLF)
LF
UF
possible
outliers
(LLF, LF)
UUF
possible
outliers
(UF, UUF)
extreme
outliers
(UUF, ∞)
Example 1a: Given the following data set: 14 34 38 43 45 47 53 54 55 56 58 85
Find the five-number summary, the IQR and determine if there are any outliers.
five – number summary _______________________
IQR = ______________
LF = ____________
UF = _______________
Outliers (if any)____________
21
Modified Boxplots are a graphical display of quantitative data. Boxplots are created using the fivenumber summary and outliers, if any. Boxplots are useful for comparing two or more data sets. You can
also use a boxplot to identify the approximate shape of the distribution of a data set especially for large
data sets; histogram and stem-and-leaf plots are better graphical displays for small data sets.
To draw a box-plot,
Step 1) Draw lines at Q1, the median, and Q3 and draw a box using the lines.
Step 2) Put an asterisk at the outliers, if any.
Step 3) Draw the lower whisker out to either the minimum value, if there are no
outliers, or to the smallest value that is not an outlier
Step 4) Draw the upper whisker out to either the maximum value, if there are no
outliers, or to the largest value that is not an outlier
Example 1b: Draw a Modified Boxplot using the data in example 1a:
22
Example 2: Given the following data set:
67 68 78 79 80 81 83 85 86 87 88 89 90 91 92 93 95
Find the five-number summary, the IQR, determine if there are any outliers and draw a modified box plot.
five-number summary ____________________
IQR = ___________
LF = ____________
UF = _____________
Outliers (if any) ___________
Modified Box-plot:
Example 3: On a baseball team, the ages of each of the players are as follows:
19 24 24 25 25 25 26 26 26 26 27 27 27 28 28 31
Find the five-number summary, the IQR, determine if there are any outliers and draw a modified box plot.
five-number summary_______________________
IQR = ____________
LF = _____________ UF = ____________
Outliers (if any) ______________
Modified box-plot:
23
4) The U.S. National Center for Health Statistics compiles data on the length of stay by patients in shortterm hospitals and publishes its findings in Vital and Health Statistics. A random sample of 21 patients
yielded the following data on length of stay, in days.
1
9
1
9
3 3 4 4 5 6 6 7
10 12 12 13 15 18 23 55
7
Find the five-number summary, the IQR, determine if there are any outliers and draw a modified box plot.
five-number summary_______________________
IQR = ____________
LF = _____________ UF = ____________
Outliers (if any)______________
Modified box-plot:
24
Section D: Homework
1) On a test, assume μ = 75 and σ = 5
a) If a person’s raw score is 68, find his z-score. ________________
b) If a person’s raw score is 92, find his z-score._________________
c) If a person’s z-score is 2.2, find their test score.________________
d) If a person’s z-score is −1.8, find their test score._______________
2) On a history test Patrick received a grade of 43 out of 50. The class mean was 35 with a standard
deviation of 6. His girlfriend Kristin, in another section, received a grade of 87 out of 100. Her class mean
was 75 with a standard deviation of 10. Who received the higher grade relative to their respective
classes? Explain your answer.
3) A representative from an auto company stated that the mean weight of the cars his company produces
is 3900 pounds with a standard deviation of 200 pounds. He went on to say that the lightest car had a zscore of −2.3 while the heaviest one had a z-score of 3.1. Find the weight of the lightest car and the
heaviest car produced by the company.
25
4) In a particular county, records indicated the assessed value of each of the 200,000 houses there. The
following percentiles were obtained.
P20 = $110,000
P50 = $160,000
P75 = $220,000
P95 = $260,000
a) What percent of houses were assessed below $110,000?________________
b) What percent of houses were assessed below $260,000?________________
c) What percent of houses were assessed between $110,000 and $220,000?_______________
d) What percent of houses were assessed above $220,000?___________________
e) What percent of houses were assessed above $110,000?_________________
f) How many houses were assessed above $110,000?__________________
5) The following set of numbers represents the grades of 20 students in an Elementary Statistic I class.
52 72 74 74 76 78 79 80 82 83 84 85 85 86 87 87 88 90 91 103
a) Find the five-number summary ___________________________
b) IQR = _______________
c) Lower Fence = ______________ Upper Fence = ___________________
d) Outliers(if any): _______________________
e) Construct a Modified Box Plot.
26
6) The following data set represents the heights of a random sample of 25 adults in inches.
48 60 60 62 63 64 65 65 65 65 66 66 66 67 68 69 70 72 72 73 73 73 74 75 76
a) Find the five-number summary ___________________________
b) IQR = _______________ c) Lower Fence = ______________ Upper Fence = ________________
d) Outliers(if any): _______________________
e) Construct a Modified Box Plot.
7) Dr. Stanley Thomas of Texas State University has collected information on millionaires. The ages of 36
millionaires are as follows:
31
60
38
61
39
64
39
64
42
66
42
66
45
67
47
68
48
68
48
69
48
71
52
71
52
74
53
75
54
77
55
79
57
79
59
79
a) Find the five-number summary ___________________________
b) IQR = _______________
c) Lower Fence = ______________ Upper Fence = ___________________
d) Outliers(if any): _______________________
e) Construct a Modified Box Plot.
27
8) The following data is the age of U.S. Presidents on inauguration day. The data has been put in
ascending order. There are 46 values.
42.88
51.47
56.03
64.27
43.65
51.96
56.18
64.61
46.42 46.85 47.46 47.96 48.28 49.29 49.33 50.5 51.02 51.08 51.09
52.05 52.3 54.09 54.24 54.41 54.54 54.56 55.24 55.33 55.54 55.96
56.29 57.18 57.65 57.89 57.97 58.85 60.93 61.07 61.34 61.97 62.27
65.86 68.06 69.96 70.6 78.17
a) Find the five-number summary ___________________________
b) IQR = _______________
c) Lower Fence = ______________ Upper Fence = _______________
d) Outliers(if any): _______________________
e) Construct a Modified Box Plot.
9) A particular statistics test was taken by 200 students. A few of the test scores are given in the
following table along with their corresponding z-scores and percentiles.
Test Score
z-score
Percentile
66
−1.5
20
69
−1
30
75
0
42
81
1
50
90
2.5
85
96
3.5
99
From only this table answer the following questions about the entire distribution of 200 scores:
a) What is the mean of this distribution?_______________
b) What is the median of this distribution?_____________
c) What is the standard deviation of the distribution?____________
d) What is the variance of the distribution?_____________
e) What percent of scores are between the test scores 69 and 90?____________
f) How many scores are above 96?________________
g) Is the distribution right skewed, left skewed or symmetric? _________________
28
10) The following are summary statistics for the final exam scores for students in psychology 101 during
the spring semester.
Mean = 68
Median = 65
First Quartile = 57
Mode = 72
Third Quartile = 84
Standard deviation = 2
60th Percentile = 73
P46 = 62
a) What final exam score did half the students’ scores surpass? ______________
b) What is the most common final exam score?________________
c) About what percent of the final exam scores are below 62? _____________
d) What percent of the final exam scores are above 73? __________
e) What final exam score is 1.25 standard deviations above the mean? _______________
f) About what percent of final exam scores are above 84? ___________
g) What final exam score is 2.5 standard deviations below the mean? ________
h) Suppose the final exam scores have a distribution that is bell-shaped, about what percent of of the
final exam scores will be between 66 and 70? ____________________
29
Section E
Summarizing Qualitative
In section A, two types of variables were discussed; qualitative (categorical) and quantitative. In this
section, you will learn how to organize and summarize qualitative data using tables and graphs.
The frequency of a category is the number of times it occurs in the data set.
A frequency distribution lists each category of data and the number of occurrences for each category of
data.
The relative frequency is the ratio (proportion or fraction) of the frequency of each category to the total
frequency and it is found by
relative frequency =
frequency
Sum of all frequencies
A relative frequency distribution lists each category of data together with the relative frequency.
Bar graphs, and pie charts are devices to graphically represent qualitative data.
Examples
1) The class levels of 25 students in an elementary statistics course are as follows;
Fr So So Jr Jr Jr Sr Fr So Jr So So Jr Sr Sr So Jr So So Jr Sr Jr So So Sr
a) Construct a frequency distribution.
b) Add a relative frequency column to the frequency distribution.
c) What percent of the data are sophomores?
d) What is the mode?
e) How many students are juniors?
f) Construct a bar graph using the frequency of the data.
g) Construct a pie chart using the relative frequency of the data.
a and b)
Class Level
Frequency
Relative Frequency
30
2) A researcher evaluated the taste of four leading brands of instant coffee by having a sample of 80
individuals taste each coffee and then select their favorite. The results are given:
A
D
B
C
D
B
B
D
B
C
A
B
B
B
C
C
D
B
D
B
A
A
D
B
D
B
A
B
B
C
B
D
C A
B B
D D
A B
a) Construct a Frequency table.
Brand
Frequency
B
A
B
B
B
A
B
C
C
D
C
B
A
C
B
D
D
B
A
B
A
B
D
B
B
A
D
D
D
B
C
D
B
C
B
B
B
B
D
B
b) Add a relative frequency column to the distribution you
constructed in part (a). Round answers to two decimal places.
Relative Frequency
c) What percent of the people chose Brand C as their favorite?
d) How many people chose Brand A as their favorite?
e) What percent of the people chose Brand B or Brand D as their favorite?
f) What is the mode?
g) Construct a bar graph using the frequency of the data.
h) Construct a pie chart using the relative frequency of the data.
31
Section E: Homework
1) The following is the eye color of a random sample of 35 patients who go to Hamilton Eye Care
Associates.
blue hazel green green brown brown brown blue green hazel green green brown
brown blue hazel brown green brown green green blue brown brown brown blue
green brown hazel green green brown brown blue hazel
a) Construct a Frequency table.
Category
Frequency
b) Add a relative frequency column to the distribution you
constructed in part (a). Round answers to two decimal places.
Relative Frequency
c) What is the mode?
d) How many patients have blue eyes?
e) What percent of the patients have hazel eyes?
f) What percent of the patients have brown or green eyes?
g) Construct a bar graph.
h) Construct a pie chart.
32
2) The network data for the Top 40 TV shows of all time by IGN Entertainment is as follows:
HBO
HBO
HBO
NBC
HBO
NBC
BBC
NBC
CBS
CBS
NBC
FOX
AMC
FOX
NBC
FOX
a) Construct a Frequency table.
Category
NBC
CBS
NBC
COMEDY CENTRAL
COMEDY CENTRAL
HBO
ABC
CBS
AMC
NBC
HBO
FOX
ABC
SCI-FI
NBC
BBC
FOX
NBC
COMEDY CENTRAL
PBS
ABC
THE WB
HBO
PBS
b) Add a relative frequency column to the distribution you
constructed in part (a). Round answers to three decimal places.
Frequency
Relative Frequency
c) How many of the Top 40 TV shows of all time were on NBC?
d) What percent of the Top 40 TV shows of all time were on HBO?
e) How many of the Top 40 TV shows of all time were on ABC, NBC or CBS?
f) What percent of the Top 40 TV shows of all time were on FOX or COMEDY CENTRAL?
g) What is the mode?
33
Section F
Summarizing Quantitative Data
Recall, there are two types of quantitative data; discrete (countable) and continuous (measurable). In this
section, you will learn how to organize and summarize the two types of quantitative data using tables and
graphs.
Classes are distinct data values or intervals of equal width that cover all the values in a data set.
Organizing Discrete Data in a Table: use the values of the discrete variable to create classes when the
number of distinct data values is small.
Organizing Discrete/Continuous Data in a Table: when a data set consists of a large number of different
discrete data values or when a data set consists of continuous data, we must create classes by using
intervals of numbers. Width of each class/interval must be the same.
Lower class limit: the smallest value that can go in the class
Upper class limit: the smallest value that can go in the next higher class; the upper class limit of
the class is the same as the lower class limit of the next higher class
Class width: the difference between the upper and lower class limits
To make creating frequency tables easier we will use the symbol ⪪ which means “up to but not
including”.
For example, an interval written 55 ⪪ 65 would contain data values 55 up to but not including 65.
The frequency of a class is the number of observations in the class.
A frequency distribution lists each class together with its frequency.
The relative frequency is the ratio (proportion or fraction) of the frequency of each class to the total
frequency and it is found by
frequency
relative frequency = Sum of all frequencies
A relative frequency distribution lists each class together with its relative frequency.
A histogram is a graphical display of a quantitative frequency table and it is constructed by drawing
rectangles for each class of data on the xy- coordinate system. x –axis is the class limits and the y-axis is
the frequency or relative frequency of the class.
 Width of each rectangle are the same
 Rectangles touch
34
Identifying the shape of a Distribution using a histogram
Uniform
Symmetric (Bell-shaped)
Right Skewed
Left Skewed
Mode – peak or high point of a histogram: unimodal – one mode, bimodal – two modes
Examples
1) A researcher with A.C. Nielson wanted to determine the number of televisions in households. He
conducts a survey of 40 randomly selected households and obtains the following data.
1 1 4 2 3 3 5 1 1 2 2 4 1 1 0 3 1 2 2 1
3 1 1 3 2 3 2 2 1 2 3 2 1 2 2 2 1 3 1 3
a) Construct a frequency table.
Class
Frequency
b) Add a relative frequency column to the frequency table you
constructed in part (a). Round answers to three decimal places.
Relative Frequency
c) How many households have at least 3 televisions?
d) What percent of households have 1 television?
e) Construct a frequency histogram of the data.
f) Describe the shape of the distribution.
35
2) The Jefferson National Bank has five tellers available to serve customers. The data in the following
table provide the number of busy tellers observed at 30 spot checks.
5
3
4
5
4
2
1
4
5
3
5
4
1
5
5
0
5
4
a) Construct a frequency table.
Class
Frequency
5
4
3
4
2
3
0
2
1
4
2
3
b) Add a relative frequency column to the frequency table you
constructed in part (a). Round answers to three decimal places.
Relative Frequency
c) How many times are more than 4 tellers busy?
d) What percent of the time are less than 3 tellers busy?
e) Construct a histogram of the data.
f) Describe the shape of the distribution.
36
3) The exam scores for the 25 students in an introductory statistics class are as follows:
34 39 54 58 60 63 64 67 70 75 77 78 76 81 82 84 85 86 88 89 89 90 96 96 99
a) Construct a frequency table.
(Starting with 30 ⪪ 40)
Interval
30 ⪪ 40
b) Add a relative frequency column to the frequency table you
constructed in part (a). Round answers to two decimal places.
Frequency Relative Frequency
c) How many students had exam scores between 70 and 90, including 70 but not including 90?
d) What percent of the students had exam scores less than 60?
e) Construct a frequency histogram of the data.
f) Describe the shape of the distribution.
37
4) The Food and Nutrition Board of the National Academy of Sciences states that the recommended daily
allowance of iron is 18mg for adult females under the age of 51. The amounts of iron intake, in milligrams,
during a 24-hour period for a sample of 45 such females follows.
9.1
12.5
14.4
16.0
18.1
9.4
12.6
14.5
16.3
18.1
10.7
12.8
14.6
16.3
18.2
10.9
13.1
14.6
16.4
18.3
11.0
13.4
14.7
16.6
18.3
a) Construct a frequency table, using 6 classes
starting with a value of 9.
Interval
11.5
13.6
15.0
16.6
18.6
11.8
13.6
15.1
16.8
19.5
12.2
13.8
15.3
17
19.8
12.3
14.2
15.6
17.3
20.7
b) Add a relative frequency column to the frequency
table you constructed in part (a). Round answers
to three decimal places.
Frequency Relative Frequency
c) How many females had an iron intake of at least 15 milligrams?
d) What percent of the females had an iron intake of between 9 and 17, including 9 but not including 17?
e) Construct a frequency histogram of the data.
f) Describe the shape of the distribution.
38
Section F: Homework
1) An anthropologist takes a random sample of 30 households and finds the following number of people
living in each household.
3 4 1 2 5 6 4 4 2 2 4 4 3 5 4 6 5 3 4 4 5 4 3 4 4 3 5 4 3 6
a) Construct a frequency table.
Class
Frequency
b) Add a relative frequency column to the frequency table you
constructed in part (a). (Round to two decimal places.)
Relative Frequency
c) How many households contain 4 people?
d) What percent of the households contain between 2 and 5 people, inclusive?
e) Construct a frequency histogram.
f) Describe the shape of the distribution.
39
2) The following data set is a random sample of 42 games with the total number of runs scored in the
game over the course of a softball season.
4 5 0 1 3 4 7 2 1 8 4 3 4 6 5 3 3 5 6 4 5
5 6 3 0 2 4 1 3 2 4 3 4 5 3 4 4 3 2 5 4 4
a) Construct a frequency table.
Class
Frequency
b) Add a relative frequency column to the frequency table
you constructed in part (a). (Round to two decimal places.)
Relative Frequency
c) How many games had 3 total runs scored?
d) What percent of the games had a total of 4 runs scored?
e) Construct a frequency histogram.
f) Describe the shape of the distribution.
40
3) Lisa Hertscar, a civil engineer, needs to determine if a traffic light needs to replace a stop sign at a
particular intersection. She keeps track of the number of cars that enter the intersection at randomly
chosen times of the day between 8am and 10pm over the course of 60 days. The results are as follows:
65 15 23 72 20 56 32 55 52 27 51 35 47 63 36 38 26 56 46 52 48
62 33 70 57 44 47 43 41 46 38 53 51 45 57 62 60 21 47 55 53 46
37 43 56 58 66 46 49 68 32 55 49 42 68 53 46 57 52 57
a) Construct a frequency table.
(Starting with 15 ⪪ 25)
Interval
15 ⪪ 25
b) Add a relative frequency column to the frequency table you
constructed in part (a). (Round to two decimal places.)
Frequency Relative Frequency
c) How many days of the week were there more than 35 cars at the intersection?
d) Construct a frequency histogram.
e) Describe the shape of the distribution?
f) Do you think there needs to be a traffic light at the intersection? Why?
41
4) The following data shows the number of minutes a random sample of 50 patrons needed to wait to
renew their license at the Bakers Basin location.
20.3 13.5 55.2 18.6 65.7 37.3 38.7 41.3 43.5 32.7 20.7 27.4 25.1 27.3 37.2 35.9 27.4
47.3 55.8 40.3 46.2 38.1 53.2 57.2 63.0 51.5 40.7 36.5 19.5 26.4 32.5 23.2 53.2 47.2
43.9 55.6 65.2 31.7 42.6 44.7 53.8 35.2 25.7 31.3 47.3 42.3 32.5 22.7 42.7 37.2
a) Construct a frequency table.
(Starting with 10 ⪪ 20)
Interval
10 ⪪ 20
b) Add a relative frequency column to the frequency table you
constructed in part (a). (Round to two decimal places.)
Frequency Relative Frequency
c) How many people waited between 40 and 60 minutes, including 40 but not including 60?
d) What percent of the people waited between 10 and 30 minutes, including 10 but not including 30?
e) Construct a histogram.
f) Describe the shape of the distribution?
42
5) The following histogram was the result of a study done by Justin Time to determine the time it took
students to write a computer program and run it successfully.
a) How many students participated in Justin’s study?
b) How many students in the study took 5.5 or more hours to write and successfully run their program?
c) What percent of students in the study wrote and successfully ran their programs in less than 4.5 hours?
d) What percent of students took from 3.5 hours up to but not including 6.5 hours to write and
successfully run their programs?
e) In which of the five intervals would the median time be?
43
Section G
Summarizing Quantitative Data (Continued)
Stem-and-Leaf plots – simple way to display small data sets (similar to histogram). The number get split
into a stem part and a leaf part. The stem part can be any number of place values, but the leaf part can
only be one place value, usually the smallest place value in the number.
1) The following table presents the daily high temperatures for West Windsor Township, NJ, in degrees
Fahrenheit, for the winter months of January and February, 2018.
19
41
37
51
24
55
47
46
30
52
34
52
27
55
38
56
17
61
47
61
15
45
65
18
36
65
32
40
39
44
58
53
42
54
63
56
45
60
65
38
42
63
31
45
26
44
49
32
37
71
42
32
78
35
38
57
32
38
44
a) Construct a stem-and-leaf plot. What is the shape of the distribution?___________________
b) Repeat part (a), but split the stems, using two lines for each stem.
44
2) A pediatrician who tested the cholesterol levels of several young patients was alarmed to find that
many had levels over 200 mg per 100 mL. The readings of 20 patients with high levels are presented in
the following table. Construct a stem-and-leaf plot of the data and describe the shape of the distribution.
220
217
209
165
212
210
208
223
202
235
218
221
196
213
214
210
188
199
210
208
3) The Food and Nutrition Board of the National Academy of Sciences states that the recommended daily
allowance of iron is 18mg for adult females under the age of 51. The amounts of iron intake, in milligrams,
during a 24-hour period for a sample of 45 such females follows. Construct a stem-and-leaf plot of the
data and describe the shape of the distribution.
6.3
12.1
14.4
16.0
18.1
9.4
12.4
14.5
16.3
18.1
10.7
12.5
14.6
16.3
18.2
10.9
12.5
14.6
16.4
18.3
11.0
12.5
14.7
16.6
18.3
11.5
12.6
15.0
16.6
18.6
11.5
12.7
15.0
16.8
19.5
11.6
12.8
15.3
17.0
19.8
11.9
13.1
15.6
17.3
20.7
45
Back-to-back stem-and-leaf plot – used to compare two data sets.
Example:
Following are the running times (in minutes) for the 15 top-grossing movies rated G or PG and the top 15
top-grossing movies rated R of all time, as of August 2018.
Movies Rated G or PG
Incredibles 2
Beauty and The Beast (2017)
Finding Dory
Star Wars: Episode I – The Phantom Menace
Star Wars
Shrek 2
E.T.: The Extra-Terrestrial
The Lion King
Toy Story 3
Frozen
Finding Nemo
The Secret Life of Pets
Despicable Me 2
The Jungle Book (2016)
Inside Out
118
129
103
133
121
93
117
89
103
108
104
90
98
105
94
Movies Rated R
The Passion of Christ
Deadpool
American Sniper
It
Deadpool 2
The Matrix Reloaded
The Hangover
The Hangover Part II
Beverly Hills Cop
The Exorcist
Logan
Ted
Saving Private Ryan
300
Wedding Crashers
126
106
132
135
119
138
96
102
105
122
135
106
170
117
113
a) Construct a back-to-back stem-and-leaf plot for these data sets.
b) Do the running times of R-rated movies differ greatly from the running times of movies rated G or PG,
or are they roughly similar?
46
Section G: Homework
1) The exam scores for the students in an introductory statistics class are as follows:
34 39 63 64 67 70 75 76 81 82 84 85 86 88 89 89 90 96 96 99 102
Construct a stem-and-leaf plot and describe the shape of the distribution.
2) Construct a stem-and-leaf plot for the following data and describe the shape of the distribution.
56
31
42
34
78
16
78
98
19
4
96
25
27
53
31
17
21
50
25
6
37
45
49
92
47
54
103
23
38
48
47
24
18
58
94
77
47
3) A soft-drink bottler sells “one-liter” bottles of soda. A consumer group is concerned that the bottler
may be shortchanging customers. Thirty bottles soda are randomly selected. The contents, in milliliters,
of the bottles chosen are shown below.
1025
986
1006
977
963
1030
1018
1010
991
975
988
999
977
1028
997
990
989
996
986
1001
1014
1004
984
993
1031
974
995
964
1017
987
Construct a stem-and-leaf plot and describe the shape of the distribution.
Is the bottler shortchanging customers?
4) A sample of 35 liberal-arts graduates yielded the following starting annual salaries. Data are in
thousands of dollars, rounded to the nearest hundred dollars.
49.0 45.8 50.3 49.6 50.0 47.7 51.8 47.3 46.7 47.0 48.1 50.1 43.6 48.0
47.7 49.8 46.4 46.1 48.5 48.9 48.2 48.1 46.2 47.3 51.7 49.0 48.2 49.9
48.1 49.8 49.5 50.4 45.3 45.3 46.5
Construct a stem-and-leaf plot and describe the shape of the distribution.
48
5) The following back-to-back stem-and-leaf plot represent the results of two random samples obtained
by Millie Gramm. The first random sample consisted of weights of carry-on luggage used by business
travelers at an airport. The second random sample consisted of weights of carry-on luggage used by nonbusiness travelers at the same airport.
Business
69
799
44468
23699
178
24
0
1
2
3
4
5
Non-Business
89
2358
11347889
022246
a) What is the mode for the business travelers?___________
b) What is the mode for the non-business travelers?________
c) Which group is more symmetrical? ____________
49
Section H
Misleading Graphs
Statistical graphs, when properly used, are powerful forms of communication. Unfortunately, when
graphs are improperly used, they can misrepresent the data and lead people to draw incorrect
conclusions.
Three of the most common forms of misrepresentation:
1) Incorrect position of the vertical scale
2) Incorrect sizing of graphical images
3) Misleading perspective for three-dimensional diagrams
1) The baseline of a graph or plot is the value at which the horizontal axis intersects with the vertical axis.
With graphs or plots that represent how much or how many of something, it may be misleading if the
baseline is not at zero.
Average Cost of a House Per Year
450000
400000
350000
300000
250000
200000
150000
100000
50000
0
Average Cost
Average Cost
Average Cost of a House Per Year
2016
2017
2018
390000
385000
380000
375000
370000
365000
360000
355000
350000
345000
2019
2016
Year
2017
2018
2019
Year
2) Area Principle: When amounts are compared by constructing an image for each amount, the areas of
the images must be proportional to the amounts. For example, if one amount is twice as much as another,
its image should have twice as much area as the other image.
The average sales price of a house in 1990 was $149,800 and in 2020 the average sales price of a house
had risen to $389,400. Note that the price in 2020 is about 2.6 times the price in 1990.
Average Sales Price of a House
1990
2020
50
3) 3-D graphs are often drawn as though the reader is looking down on them. This makes the bars look
shorter than they really are.
Average Sales Price of a House
400000
350000
300000
250000
200000
150000
100000
50000
0
1990
2020
51
Section H: Homework
1) The following graphs represent the number of people who purchased a particular item online from
Jennifer’s Jewelry store during the years 2018, 2019 and 2020. Which graph is misleading? Why?
Number Sold
Number Sold
500
500
450
400
350
300
250
200
150
100
50
0
490
480
470
460
450
440
430
2018
2019
2020
2018
Graph 1
2019
2020
Graph 2
2) The number of girls who played softball in 2020 has tripled from the number of girls who played
softball in 2010.
a) Does the pictograph below accurately present this information accurately?
b) Why or why not?
2010
2020
3) Explain why 3-dimensiontal graphs can be misleading.
52
Download