Intro to Statistics Assignment 1 ECON 2210A October 7th 2024 Muminur Rahman (Student ID: 101276022) Question 1-16: Suppose a university conducts a survey to ask about potential fee increases for using the university’s recreational center. The university decides it will be easier to ask only graduating seniors about the proposed fee increase. What kind of bias is likely present in this data collection approach? Response: The data collection that the university is choosing to use is an example of selection bias. Selection bias happens when a select group is representing another group who does not have any say in the data being collected. In this case, the representatives being studied are graduating seniors, and the students in the university who will be impacted by the fees in later years don’t have a say in it. This can cause a number of problems for the university, as the graduating seniors answers will differ from a student who isn’t graduating in the current year. The survey being conducted may not be as accurate as what the university hopes for, as it can lead to inaccurate conclusions, and may spark conflict to current students planning on coming back to university in the later year. Question 1-36: Give the name of the kind of sampling that was most likely used in each of the following cases: a) a Wall Street Journal poll of 2,000 people to determine the president’s approval rating ANS: This is an example of simple random sampling. This type of sampling selects items from a population so that every sample gets a chance to be selected. In this text, each and every one of the 2000 people have an equal chance to determine the president’s approval rating. b) A poll taken of each of the General Motors (GM) dealerships in Ohio in December to determine an estimate of the average number of Chevrolets not yet sold by GM dealerships in the United States ANS: All GM dealerships are divided into groups in order to determine the average number of Chevrolets not yet sold. This text can be viewed as cluster sampling, where a population is divided into groups, and the item chosen (in this case, the number of Chevrolets) are selected from each cluster in order to find the average amount not sold. c) A quality-assurance procedure within a Frito-Lay manufacturing plant that tests every bag of Fritos Corn Chips produced to make sure the bag is sealed properly ANS: The Fritos Manufacturing plant is testing the entirety of fritos chips bags. This can be related to census sampling, because the entire set of measurments are being conducted. The chip bags are all being measured as an enumeration of the set of measurements. d) A sampling technique in which a random sample from each of the tax brackets is obtained by the Internal Revenue Service to audit tax returns ANS: Each tax bracket is randomly sampled group by group. Each tax bracket can be considered as a divided population or “strata”. Since the Internal Revenue Agency is conducting random sampling on each tax bracket, this can be exemplified as stratified random sampling, where a population is divided into strata based on characteristics, and are conducted inn random sampling within each group. Question 1-42: For each of the following, indicate whether the data are cross-sectional or time- series: a) Quarterly Unemployment rates ANS: Because the data for unemployment rate is collected every quarter, it is time series data. The data being collected is observed through points in time instead of one single point. b) Unemployment Rates by State ANS: This data is being observed at one singular point in time and is being compared to different groups with the same topic of data. This data is observed as cross sectional data. c) Monthly Sales Because monthly sales are observed in intervals in order to track sales overtime, this is an example of time series data. d) Employment Satisfaction Data for a Company Because the company is collecting data within their company at a single point in time, and each group is being compared to one another on the topic of satisfaction, this is considered to be cross sectional data. If the company provided employee satisfaction data repeatedly over time, this could be time series data, but with the given information we can only assume that it is cross sectional. Question 1-50: As part of an economics research study, an analyst has accessed data compiled by the U.S. Bureau of Labor Statistics. The data are in a file named BLS County Data (source: www.ers.usda.gov). Consider the data in columns A–L, and indicate what level of data is represented by the variables in each column. A) FIPS Codes ANS: FIPS codes are numbers that indicate the geographical areas in the world. The FIPS codes in column A do not have any value or order, but are used as identifiers for where we can find these areas. Because FIPS codes don’t have any numerical meaning, column A is a nominal level of data. B/C) State, Area Name ANS: Since both the state initials and area name hold no numerically significant value and are not ranked, column B and C are nominal level of data D/F) Rural Urban Continuum Code (2003/2013) ANS: A rural urban continuum code is used as a way to classify counties based on the amount of urbanization in their area. The numbers displayed in the column are used to represent the level of urbanization, and to classify the county as either a rural or urban county. Because the numbers do not give any meaningful numerical value ofmeasurment, and the values do not have a true zero, columns D and F are considered to be nominal level of data E/G) Urban Influence codes ANS: Like the rural urban continuum code data sets, UIC is considered to be nominal data because it is used to categorize a state or county based on urbanized areas, and less on numerical measurements H/L) Civilian Labour Force ANS: The civilian labor force is determined by the amount of people within the state that are employed, or those who are unemployed, but are actively looking for a job. It is best described as a ratio data set, because it has numerical values and creates quantitative data, and the civilian labor force also has a true zero point. The true zero point can be seen to show that there is nobody in the population employed or looking for work. I) Employed ANS: The employed column represent the amount of people withing the state who are employed currently. This numerical data set has value because it represent an amount of employed people within a given area. It also has a true zero point, where nobody is considered to be employed. The data is best described as ratio data. J) Unemployed ANS: Like the employed column, the unemployed column is best described as a ratio column. The column represents the number of unemployed people who are still actively looking for work. The numerical data can have a true zero point, where nobody is considered unemployed. K) Unemployment Rate ANS: The unemployment rate shows the rate of unemployment within state and is found by finding the difference between the labour force and the unemployed. This data set is also shown as ratio data, because the rate of unemployment can be at true zero, sinc unemployment can be at true zero. Question 2-20: a) Using the 2𝑘 ≥ 𝑛 guideline, what is the minimum number of classes that should be used to display the data in the “Total” column in a grouped data frequency distribution? ANS: In the “Total” column, there are 41 data sets that need to be grouped. Using the 2k≥n method, we must find the value of 2^k that is just greater than or equal to 41. 2^5 = 32, and 2^6 = 64 With this, we know that 2^6 is the smallest value that is just greater than 41, therfore, the minimum number of classes k should be 6. b) Referring to part a, what should the class width be, assuming you round the width up to the nearest 1,000 passengers? The width of the classes is determined by dividing the range by the class. In order to find the range, we must find the maximum amount of passengers in the total column, and subtract it from the minimum amount of passengers. Maximum amount : 602,708 by Southwest Airlines Co Minimum amount : 160 by Caribbean Sun Airline Inc. Range = (Max – Min) = 602,708 – 160 = 602,548 passengers Now with this information, we divide it by the amount of classes we have, which is 6. Width of each class = Range / Class = 602,548 / 6 = 100424.67 Therefor, the class width should be about 100,000 passengers (rounded to the nearest 1000) c) Construct and interpret a frequency histogram for the data - In the graph we can interpret the data on the frequency histogram. The ranges for each group is shown on the bottom of the horizontal axis, while the vertical axis shows the amount of groupings in the specific range. Through the frequency histogram, we can conclude that in December of last year, most airlines within the orlando airport carried 160-100,584. Question 2-22: A) Using Excel’s Insert Statistical Chart feature, construct a histogram of the coffee consumption data. Change the bin width to  and include data labels on the histogram. Add all appropriate titles. Briefly comment on what the histogram reveals concerning the data. - In order to create a histogram with the given information, we must need to know how many classes there must be. In order to find that we can use the 2𝑘 ≥ 𝑛 method. - There are 100 inputs taken from finish coffee drinkers. 2^7 is the closest number to using the method, therefore, there are 7 classes. 100 - The histogram above shows the frequency between coffee drinkers and how much coffe is normally consumed. Through the histogram, we can see that the most frequent amount of coffee consumed by Finnish coffee drinkers ranges to around 11.1 - 13 kilograms of coffee. B) Develop a relative frequency distribution and a cumulative relative frequency distribution of the coffee data using the same classes as the histogram. What percentage of the coffee drinkers sampled consume 13.1 kg or more annually? In order to find the percentage for the frequency distribution, we must know the class amount, count the frequency in each class, then divide the frequency of the class by the total number of data points - Classes = 7 Frequency for each class: [7.0 - 9.0] = 7, [9.1 - 11.0] = 19, [11.1 - 13.0] = 44, [13.1 - 15.0] = 25, [15.1 - 17.0] = 4, [17.1 - 19.0] = 1, [19.1 - 21.0] = 0 Total number of data point: 100 - Frequency distribution = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡𝑠 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠 o o o o o o o Class 1 [7.0 - 9.0] = 7 / 100 = 0.07 = 7% Class 2 [9.1 - 11.0] = 19 / 100 = 0.19 = 19% Class 3 [11.1 - 13.0] = 44 / 100 = 0.44 = 44% Class 4 [13.1 - 15.0] = 25 / 100 = 0.25 = 25% Class 5 [15.1 - 17.0] = 4 / 100 = 0.04 = 4% Class 6 [17.1 - 19.0] = 1 / 100 = 0.01 = 1% Class 7 [19.1 - 21.0] = 0 /100 = 0 = 0% - - In order to create the cumulative frequency distribution, we must add the sum of the frequencies with the classes before it. o Since class one has no class before it, the cumulative frequency is 7. o Class 2 has a cumulative frequency of 7 + 19, which is 26. o Class 3 has a cumulative frequency of 26 + 44, which is 70 o Class 4 has a cumulative frequency of 70 + 25, which is 95 o Class 5 has a cumulative frequency of 95 + 4, which is 99 o Class 6 has a cumulative frequency of 99 + 1, which is 100. o Since Class 7 has a frequency of 0, the cumulative frequency of class 7 is also 100. To answer the question, how many coffee drinkers consumed 13.1 kg or more annually, the answer would be 30%. The answer is solved by adding the percentages of each class at 13.1 and more, which is class 4 and up. A health insurance company selected a random sample of hospitals from each of four categories of hospitals: university related, religious related, community owned, and privately owned. At issue is the hospital charges associated with outpatient gallbladder surgery. The data are in the file called Hospitals. Question 2-42 : 1) Compute the average charge for each hospital category When computing the average charge for each of the hospital categories, we must take the mean of all the data, which is the average amount of all the data together. Average charge = (𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑐ℎ arg 𝑒𝑠 𝑓𝑜𝑟 𝑋 ℎ𝑜𝑠𝑝𝑖𝑡𝑎𝑙𝑠) 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑙𝑙 𝑋 ℎ𝑜𝑠𝑝𝑖𝑡𝑎𝑙𝑠 By using this equation, we can use =SUM(A2:A11)/10 Average Charge of university related hospitals = 63980 / 10 = 6390 Therefor the average charge for university related hospitals is $6390. - The Average Charge for religious affiliated hospitals can be found by using =SUM(B2:B10)/9. Average charge = (𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑐ℎ arg 𝑒𝑠 𝑓𝑜𝑟 𝑋 ℎ𝑜𝑠𝑝𝑖𝑡𝑎𝑙𝑠) 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑙𝑙 𝑋 ℎ𝑜𝑠𝑝𝑖𝑡𝑎𝑙𝑠 Average Charge of religious related hospitals = 32320 / 9 = 3591 Therefor, the average charge for religious hospitals comes to $3591. - The Average Charge for Municipally owned hospitals can be found by using =SUM(C2:C9)/8 Average charge = (𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑐ℎ arg 𝑒𝑠 𝑓𝑜𝑟 𝑋 ℎ𝑜𝑠𝑝𝑖𝑡𝑎𝑙𝑠) 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑙𝑙 𝑋 ℎ𝑜𝑠𝑝𝑖𝑡𝑎𝑙𝑠 The average charge for municipally owned hospitals = 36905 / 8 = 4613 Therfor, the average charge for municipally owned hospitals is $4613. - The Average Charge for private hospitals can be shown in =SUM(D2:D10)/9 Average charge for privately owned hospitals = 46715 2) Construct a bar chart showing the averages by hospital category. - - The graph above shows the difference between each average charge for the four hospital categories. The vertical axis shows the dollar amount for each average, and the horizontal axis labels each of the categories. By looking at the graph we can see that university hospitals averagely charge the highest amount, whereas the averagely lowest charge is from religious hospitals. 3) Discuss why a pie chart would not in this case be an appropriate graphical tool. A pie chart is used to calculate values in percentages, which can be useful when distributing a common topic with several subtopics. The reason why a pie chart is not useful in this situation is because we are calculating average values of 4 different topics, and there is no set value of money that can be distributed. A bar graph can help visualize this data in a more free and broader way, and helps us compare values to one another. The average charges can not be computed into a percentage, because it does not come from a pool of money. Question 2-52: Johnson Oil and Gas owns a series of gasoline stations in northern Ohio. Below are data for July 1 retail gasoline prices at one of the stations and the price per gallon of propane. 1) Construct a line chart of regular grade gasoline for the years shown - The graph above shows a line graph of the regular grade price per gallon. The vertical axis shows the price per gallon, and the horizontal axis shows the value at each year. In the graph we can see an increasing slope with a decreasing slope forming after the year 2015. We see a large decrease in value after 2020 to 2021. The decreasing slope can be caused by the covid-19 pandemic where the demand for gasoline decreased exponentially. 2) Construct a line chart of propane for the years shown. - The graph above shows the line graph for propane prices between each year. In this graph we can see a constant slope from 2002, to 2021. The graph shows that there are no effects to propane prices throughout each year, as they slope upwards each time. 3) Construct the appropriate chart for determining whether there is a relationship between gasoline and propane prices. Briefly comment on the nature of any relationship you believe your chart reveals. - When comparing the two graphs, the best way to show the similarities and differences between them is by conducting a multi line graph, which allows us to put the two graphs into one singular graph. Through this graph we can how both lines slope upwards, with gasoline constantly having a higher value than propane up until 2020. With this information we can see that the affect on gasoline has no affect on the propane prices. We can also see that the Covid-19 pandemic only affected the gasoline prices, and had no effect on the propane prices. Question 2-52: As part of a study on its restaurant wait times, the manager of a Phoenix restaurant recently sampled 18 customers and recorded the time, in minutes, each was required to wait before being seated. The following sampled times were measured: a) Compute the mean wait time for this sample of customers. To find the mean of the sample data, we must add them all up together, then divide the total by the amount of data we have. This can be done by computing =AVERAGE(A2:A19) - Σ X = (39 + 54 + 24 + 36 + 34 + 54 + 43 + 55 + 33 + 19 + 20 + 74 + 56 + 43 + 24 + 34) = 703 X = ΣX / n N = 18 X = 703 / 18 = 39.05 b) Compute the median wait time for this sample of customers. To find the median, we need to reorganize the set of values from largest to smallest. (19, 20, 24, 24, 27, 33, 34, 34, 34, 36, 39, 43, 43, 54, 54, 55, 56, 74) To find the median, we need to find the middle number in the number line. Because we have an even amount of values, the way to find the median is to ad up the two middle values within the number line and find the average between the two. Median = 34 + 36 = 70 Median = 70 / 2 = 35 Therefor, the median wait time is 35 minutes. This is also computed as =MEDIAN(A2:A19) c) Compute the variance and standard deviation of wait times for this sample of customers. To find the variance, we first need to find the total of all values, and multiply it to a power of 2 (Square) X = 703 703 ^ 2 = 494209 Now we need the total of X and multiply it to a power of 2. This means we must set every value on the number line to the power of 2. (19^2, 20^2, 24^2, 24^2, 27^2, 33^2, 34^2, 34^2, 34^2, 36^2, 39^2, 43^2, 43^2, 54^2, 54^2, 55^2, 56^2, 74^2) = 31183 We can now plug our information into our equation: 2 𝑆 = Σ𝑥 2 − (Σ𝑥)2 𝑛 𝑛−1 = 31183− 494209 18 18−1 = 219.23 Therefor, the varience wait time is 219.23 To find the standard deviation, all we need to do is find the square root of the variance. √219.23 = 14.8 Therfor, the standard deviation is 14.8 d) Develop a frequency distribution using six classes, each with a class width of 10. Make the lower limit of the first class 15. To find the frequency distribution, we must find the number of frequency within each class and divide it by the total number of data points. [15-24] = 4, [25-34] = 5, [35-44] = 4, [45-54] = 2, [55-64] = 2, [65-74] = 1 Class 1 = 4/18 = 0.22 = 22% Class 2 = 5/18 = 0.28 = 28% Class 3 = 4/18 = 0.22 = 22% Class 4 = 2/18 = 0.11 = 11% Class 5 = 2/18 = 0.11 = 11% Class 6 = 1/18 = 0.06 = 6% e) Develop a frequency histogram for the frequency distribution. Above shows the Frequency distributed histogram. The horizontal axis shows the classes and the vertical axis shows the amount of data within the class. Looking at the graph we can see that the most common wait time for being seated is about 25-34 minutes. f) Construct a box and whisker plot of these data. - The graph above shows the waiting times within a box and whisker plot. The vertical line in the middle of the box shows the median, and the top and bottom of the whisker plot shoes the max and min Q1 is 26.25, Q2 is 35, and Q3 is 54. The lower limit is 19, and the upper limit is 74. g) The manager is considering giving a complimentary drink to customers whose wait time is longer than the third quartile. Determine the minimum number of minutes a customer would have to wait to receive a complimentary drink To find the minimum amount of time it would take to recieve a complimentary drink, we use the maximum amount of of time anyone is willing to wait for. This is valued at Q3, which is 54 minutes. Therfor, the minimum amount of waiting time a customer should wait to recieve a complimentary drink is 54 minutes.