Uploaded by Muminur Rahman

Rahman 101276022 Assignment 1

advertisement
Intro to Statistics Assignment 1
ECON 2210A
October 7th 2024
Muminur Rahman (Student ID: 101276022)
Question 1-16: Suppose a university conducts a survey to ask about potential fee increases for
using the university’s recreational center. The university decides it will be easier to ask
only graduating seniors about the proposed fee increase. What kind of bias is likely present
in this data collection approach?
Response: The data collection that the university is choosing to use is an example of selection
bias. Selection bias happens when a select group is representing another group who does not
have any say in the data being collected. In this case, the representatives being studied are
graduating seniors, and the students in the university who will be impacted by the fees in later
years don’t have a say in it. This can cause a number of problems for the university, as the
graduating seniors answers will differ from a student who isn’t graduating in the current year.
The survey being conducted may not be as accurate as what the university hopes for, as it can
lead to inaccurate conclusions, and may spark conflict to current students planning on coming
back to university in the later year.
Question 1-36: Give the name of the kind of sampling that was most likely used in each of the
following cases:
a) a Wall Street Journal poll of 2,000 people to determine the president’s approval
rating
ANS: This is an example of simple random sampling. This type of sampling selects items
from a population so that every sample gets a chance to be selected. In this text, each and
every one of the 2000 people have an equal chance to determine the president’s approval
rating.
b) A poll taken of each of the General Motors (GM) dealerships in Ohio in December
to determine an estimate of the average number of Chevrolets not yet sold by GM
dealerships in the United States
ANS: All GM dealerships are divided into groups in order to determine the average
number of Chevrolets not yet sold. This text can be viewed as cluster sampling, where a
population is divided into groups, and the item chosen (in this case, the number of
Chevrolets) are selected from each cluster in order to find the average amount not sold.
c) A quality-assurance procedure within a Frito-Lay manufacturing plant that tests
every bag of Fritos Corn Chips produced to make sure the bag is sealed properly
ANS: The Fritos Manufacturing plant is testing the entirety of fritos chips bags. This can
be related to census sampling, because the entire set of measurments are being conducted.
The chip bags are all being measured as an enumeration of the set of measurements.
d) A sampling technique in which a random sample from each of the tax brackets is
obtained by the Internal Revenue Service to audit tax returns
ANS: Each tax bracket is randomly sampled group by group. Each tax bracket can be
considered as a divided population or “strata”. Since the Internal Revenue Agency is
conducting random sampling on each tax bracket, this can be exemplified as stratified
random sampling, where a population is divided into strata based on characteristics, and
are conducted inn random sampling within each group.
Question 1-42: For each of the following, indicate whether the data are cross-sectional or time-
series:
a) Quarterly Unemployment rates
ANS: Because the data for unemployment rate is collected every quarter, it is time series
data. The data being collected is observed through points in time instead of one single
point.
b) Unemployment Rates by State
ANS: This data is being observed at one singular point in time and is being compared to
different groups with the same topic of data. This data is observed as cross sectional data.
c) Monthly Sales
Because monthly sales are observed in intervals in order to track sales overtime, this is an
example of time series data.
d) Employment Satisfaction Data for a Company
Because the company is collecting data within their company at a single point in time,
and each group is being compared to one another on the topic of satisfaction, this is
considered to be cross sectional data. If the company provided employee satisfaction data
repeatedly over time, this could be time series data, but with the given information we
can only assume that it is cross sectional.
Question 1-50: As part of an economics research study, an analyst has accessed data compiled
by the U.S. Bureau of Labor Statistics. The data are in a file named BLS County Data
(source: www.ers.usda.gov). Consider the data in columns A–L, and indicate what level of
data is represented by the variables in each column.
A) FIPS Codes
ANS: FIPS codes are numbers that indicate the geographical areas in the world. The FIPS
codes in column A do not have any value or order, but are used as identifiers for where
we can find these areas. Because FIPS codes don’t have any numerical meaning, column
A is a nominal level of data.
B/C) State, Area Name
ANS: Since both the state initials and area name hold no numerically significant value and are
not ranked, column B and C are nominal level of data
D/F) Rural Urban Continuum Code (2003/2013)
ANS: A rural urban continuum code is used as a way to classify counties based on the amount of
urbanization in their area. The numbers displayed in the column are used to represent the level of
urbanization, and to classify the county as either a rural or urban county. Because the numbers do
not give any meaningful numerical value ofmeasurment, and the values do not have a true zero,
columns D and F are considered to be nominal level of data
E/G) Urban Influence codes
ANS: Like the rural urban continuum code data sets, UIC is considered to be nominal data
because it is used to categorize a state or county based on urbanized areas, and less on numerical
measurements
H/L) Civilian Labour Force
ANS: The civilian labor force is determined by the amount of people within the state that
are employed, or those who are unemployed, but are actively looking for a job. It is best
described as a ratio data set, because it has numerical values and creates quantitative data,
and the civilian labor force also has a true zero point. The true zero point can be seen to
show that there is nobody in the population employed or looking for work.
I) Employed
ANS: The employed column represent the amount of people withing the state who are
employed currently. This numerical data set has value because it represent an amount of
employed people within a given area. It also has a true zero point, where nobody is
considered to be employed. The data is best described as ratio data.
J) Unemployed
ANS: Like the employed column, the unemployed column is best described as a ratio
column. The column represents the number of unemployed people who are still actively
looking for work. The numerical data can have a true zero point, where nobody is considered
unemployed.
K) Unemployment Rate
ANS: The unemployment rate shows the rate of unemployment within state and is found
by finding the difference between the labour force and the unemployed. This data set is also
shown as ratio data, because the rate of unemployment can be at true zero, sinc unemployment
can be at true zero.
Question 2-20:
a) Using the 2𝑘 ≥ 𝑛 guideline, what is the minimum number of classes that should be
used to display the data in the “Total” column in a grouped data frequency
distribution?
ANS: In the “Total” column, there are 41 data sets that need to be grouped. Using the 2k≥n
method, we must find the value of 2^k that is just greater than or equal to 41.
2^5 = 32, and 2^6 = 64
With this, we know that 2^6 is the smallest value that is just greater than 41, therfore, the
minimum number of classes k should be 6.
b) Referring to part a, what should the class width be, assuming you
round the width up to the nearest 1,000 passengers?
The width of the classes is determined by dividing the range by the class. In order to find
the range, we must find the maximum amount of passengers in the total column, and
subtract it from the minimum amount of passengers.
Maximum amount : 602,708 by Southwest Airlines Co
Minimum amount : 160 by Caribbean Sun Airline Inc.
Range = (Max – Min) = 602,708 – 160 = 602,548 passengers
Now with this information, we divide it by the amount of classes we have, which is 6.
Width of each class = Range / Class = 602,548 / 6 = 100424.67
Therefor, the class width should be about 100,000 passengers (rounded to the nearest
1000)
c) Construct and
interpret a
frequency histogram for the data
- In the graph we can interpret the data on the frequency histogram. The ranges for each group is
shown on the bottom of the horizontal axis, while the vertical axis shows the amount of
groupings in the specific range. Through the frequency histogram, we can conclude that in
December of last year, most airlines within the orlando airport carried 160-100,584.
Question 2-22:
A) Using Excel’s Insert Statistical Chart feature, construct a histogram of the coffee
consumption data. Change the bin width to  and include data labels on the
histogram. Add all appropriate titles. Briefly comment on what the histogram
reveals concerning the data.
- In order to create a histogram with the given information, we must need to know
how many classes there must be. In order to find that we can use the 2𝑘 ≥ 𝑛
method.
- There are 100 inputs taken from finish coffee drinkers. 2^7 is the closest number to
using the method, therefore, there are 7 classes.
100
- The histogram above shows the frequency between coffee drinkers and how much coffe is
normally consumed. Through the histogram, we can see that the most frequent amount of coffee
consumed by Finnish coffee drinkers ranges to around 11.1 - 13 kilograms of coffee.
B) Develop a relative frequency distribution and a cumulative relative
frequency distribution of the coffee data using the same classes as
the histogram. What percentage of the coffee drinkers sampled
consume 13.1 kg or more annually?
In order to find the percentage for the frequency distribution, we must know the class
amount, count the frequency in each class, then divide the frequency of the class by
the total number of data points
-
Classes = 7
Frequency for each class: [7.0 - 9.0] = 7, [9.1 - 11.0] = 19, [11.1 - 13.0] = 44,
[13.1 - 15.0] = 25, [15.1 - 17.0] = 4, [17.1 - 19.0] = 1, [19.1 - 21.0] = 0
Total number of data point: 100
-
Frequency distribution = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡𝑠
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠
o
o
o
o
o
o
o
Class 1 [7.0 - 9.0] = 7 / 100 = 0.07 = 7%
Class 2 [9.1 - 11.0] = 19 / 100 = 0.19 = 19%
Class 3 [11.1 - 13.0] = 44 / 100 = 0.44 = 44%
Class 4 [13.1 - 15.0] = 25 / 100 = 0.25 = 25%
Class 5 [15.1 - 17.0] = 4 / 100 = 0.04 = 4%
Class 6 [17.1 - 19.0] = 1 / 100 = 0.01 = 1%
Class 7 [19.1 - 21.0] = 0 /100 = 0 = 0%
-
-
In order to create the cumulative frequency distribution, we must add the sum of
the frequencies with the classes before it.
o Since class one has no class before it, the cumulative frequency is 7.
o Class 2 has a cumulative frequency of 7 + 19, which is 26.
o Class 3 has a cumulative frequency of 26 + 44, which is 70
o Class 4 has a cumulative frequency of 70 + 25, which is 95
o Class 5 has a cumulative frequency of 95 + 4, which is 99
o Class 6 has a cumulative frequency of 99 + 1, which is 100.
o Since Class 7 has a frequency of 0, the cumulative frequency of class 7 is
also 100.
To answer the question, how many coffee drinkers consumed 13.1 kg or more
annually, the answer would be 30%. The answer is solved by adding the
percentages of each class at 13.1 and more, which is class 4 and up.
A health insurance company selected a random sample of hospitals from each
of four categories of hospitals: university related, religious related, community owned,
and privately owned. At issue is the hospital charges associated with outpatient
gallbladder surgery. The data are in the file called Hospitals.
Question 2-42 :
1) Compute the average charge for each hospital category
When computing the average charge for each of the hospital categories, we must
take the mean of all the data, which is the average amount of all the data together.
Average charge =
(𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑐ℎ arg 𝑒𝑠 𝑓𝑜𝑟 𝑋 ℎ𝑜𝑠𝑝𝑖𝑡𝑎𝑙𝑠)
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑙𝑙 𝑋 ℎ𝑜𝑠𝑝𝑖𝑡𝑎𝑙𝑠
By using this equation, we can use =SUM(A2:A11)/10
Average Charge of university related hospitals = 63980 / 10
= 6390
Therefor the average charge for university related hospitals is $6390.
-
The Average Charge for religious affiliated hospitals can be found by using
=SUM(B2:B10)/9.
Average charge =
(𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑐ℎ arg 𝑒𝑠 𝑓𝑜𝑟 𝑋 ℎ𝑜𝑠𝑝𝑖𝑡𝑎𝑙𝑠)
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑙𝑙 𝑋 ℎ𝑜𝑠𝑝𝑖𝑡𝑎𝑙𝑠
Average Charge of religious related hospitals = 32320 / 9
= 3591
Therefor, the average charge for religious hospitals comes to $3591.
-
The Average Charge for Municipally owned hospitals can be found by using
=SUM(C2:C9)/8
Average charge =
(𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑐ℎ arg 𝑒𝑠 𝑓𝑜𝑟 𝑋 ℎ𝑜𝑠𝑝𝑖𝑡𝑎𝑙𝑠)
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑙𝑙 𝑋 ℎ𝑜𝑠𝑝𝑖𝑡𝑎𝑙𝑠
The average charge for municipally owned hospitals = 36905 / 8
= 4613
Therfor, the average charge for municipally owned hospitals is $4613.
-
The Average Charge for private hospitals can be shown in =SUM(D2:D10)/9
Average charge for privately owned hospitals = 46715
2) Construct a bar chart showing the averages by hospital category.
-
-
The graph above shows the difference between each average charge for the four
hospital categories. The vertical axis shows the dollar amount for each average, and
the horizontal axis labels each of the categories.
By looking at the graph we can see that university hospitals averagely charge the
highest amount, whereas the averagely lowest charge is from religious hospitals.
3) Discuss why a pie chart would not in this case be an appropriate
graphical tool.
A pie chart is used to calculate values in percentages, which can be useful when
distributing a common topic with several subtopics. The reason why a pie chart is not
useful in this situation is because we are calculating average values of 4 different topics,
and there is no set value of money that can be distributed. A bar graph can help visualize
this data in a more free and broader way, and helps us compare values to one another.
The average charges can not be computed into a percentage, because it does not come
from a pool of money.
Question 2-52: Johnson Oil and Gas owns a series of gasoline stations in northern
Ohio. Below are data for July 1 retail gasoline prices at one of the stations
and the price per gallon of propane.
1) Construct a line chart of regular grade gasoline for the years shown
-
The graph above shows a line graph of the regular grade price per gallon. The vertical
axis shows the price per gallon, and the horizontal axis shows the value at each year.
In the graph we can see an increasing slope with a decreasing slope forming after the
year 2015. We see a large decrease in value after 2020 to 2021. The decreasing slope
can be caused by the covid-19 pandemic where the demand for gasoline decreased
exponentially.
2) Construct a line chart of propane for the years shown.
- The graph above shows the line graph for propane prices between each year. In this graph we
can see a constant slope from 2002, to 2021. The graph shows that there are no effects to propane
prices throughout each year, as they slope upwards each time.
3) Construct the appropriate chart for determining whether there is a
relationship between gasoline and propane prices. Briefly comment
on the nature of any relationship you believe your chart reveals.
-
When comparing the two graphs, the best way to show the similarities and differences
between them is by conducting a multi line graph, which allows us to put the two graphs
into one singular graph. Through this graph we can how both lines slope upwards, with
gasoline constantly having a higher value than propane up until 2020. With this
information we can see that the affect on gasoline has no affect on the propane prices. We
can also see that the Covid-19 pandemic only affected the gasoline prices, and had no
effect on the propane prices.
Question 2-52: As part of a study on its restaurant wait times, the manager of a
Phoenix restaurant recently sampled 18 customers and recorded the time,
in minutes, each was required to wait before being seated. The following
sampled times were measured:
a) Compute the mean wait time for this sample of customers.
To find the mean of the sample data, we must add them all up together, then divide the
total by the amount of data we have. This can be done by computing
=AVERAGE(A2:A19)
-
Σ X = (39 + 54 + 24 + 36 + 34 + 54 + 43 + 55 + 33 + 19 + 20 + 74 + 56 + 43 + 24 +
34) = 703
X = ΣX / n
N = 18
X = 703 / 18 = 39.05
b) Compute the median wait time for this sample of customers.
To find the median, we need to reorganize the set of values from largest to smallest.
(19, 20, 24, 24, 27, 33, 34, 34, 34, 36, 39, 43, 43, 54, 54, 55, 56, 74)
To find the median, we need to find the middle number in the number line. Because we
have an even amount of values, the way to find the median is to ad up the two middle
values within the number line and find the average between the two.
Median = 34 + 36 = 70
Median = 70 / 2 = 35
Therefor, the median wait time is 35 minutes. This is also computed as
=MEDIAN(A2:A19)
c) Compute the variance and standard deviation of wait times for this
sample of customers.
To find the variance, we first need to find the total of all values, and multiply it to a
power of 2 (Square)
X = 703
703 ^ 2 = 494209
Now we need the total of X and multiply it to a power of 2. This means we must set
every value on the number line to the power of 2.
(19^2, 20^2, 24^2, 24^2, 27^2, 33^2, 34^2, 34^2, 34^2, 36^2, 39^2, 43^2, 43^2, 54^2,
54^2, 55^2, 56^2, 74^2) = 31183
We can now plug our information into our equation:
2
𝑆 =
Σ𝑥 2 −
(Σ𝑥)2
𝑛
𝑛−1
=
31183−
494209
18
18−1
= 219.23
Therefor, the varience wait time is 219.23
To find the standard deviation, all we need to do is find the square root of the variance.
√219.23 = 14.8
Therfor, the standard deviation is 14.8
d) Develop a frequency distribution using six classes, each with a class
width of 10. Make the lower limit of the first class 15.
To find the frequency distribution, we must find the number of frequency within each
class and divide it by the total number of data points.
[15-24] = 4, [25-34] = 5, [35-44] = 4, [45-54] = 2, [55-64] = 2, [65-74] = 1
Class 1 = 4/18 = 0.22 = 22%
Class 2 = 5/18 = 0.28 = 28%
Class 3 = 4/18 = 0.22 = 22%
Class 4 = 2/18 = 0.11 = 11%
Class 5 = 2/18 = 0.11 = 11% Class 6 = 1/18 = 0.06 = 6%
e) Develop a frequency histogram for the frequency distribution.
Above shows the Frequency distributed histogram. The horizontal axis shows the classes and the
vertical axis shows the amount of data within the class. Looking at the graph we can see that the
most common wait time for being seated is about 25-34 minutes.
f) Construct a box and whisker plot of these data.
-
The graph above shows the waiting times within a box and whisker plot. The vertical
line in the middle of the box shows the median, and the top and bottom of the whisker
plot shoes the max and min Q1 is 26.25, Q2 is 35, and Q3 is 54. The lower limit is 19,
and the upper limit is 74.
g) The manager is considering giving a complimentary drink to
customers whose wait time is longer than the third quartile.
Determine the minimum number of minutes a customer would have
to wait to receive a complimentary drink
To find the minimum amount of time it would take to recieve a complimentary drink, we
use the maximum amount of of time anyone is willing to wait for. This is valued at Q3,
which is 54 minutes. Therfor, the minimum amount of waiting time a customer should
wait to recieve a complimentary drink is 54 minutes.
Download