Section A Basic Introduction to Statistics What do you think of when you see/hear the word statistics? The majority of people immediately think of numerical facts, data, graphs and tables. But not only do statisticians collect, classify and tabulate data, they also analyze data in order to make generalizations and decisions. Why study statistics? 1) Everyone comes in contact with statistics in everyday life. 2) People should understand reports in newspapers, magazine and journals. 3) People should be able to question the statistics they read, and not blindly accept these as proven fact. 4) Many areas of study use statistics, such as; psychology, sociology, business, biology, government, engineering, science education and even areas such as history, language and the arts. Statistics is the science of collecting, organizing, summarizing, and analyzing data to draw conclusions or answer questions. It also provides a measure of confidence in any conclusions. Two types of statistics: 1) Descriptive statistics: the use of numbers to summarize information which is known about some population. [collecting, organizing and summarizing the data] 2) Inferential statistics: the use of numbers related to a random sample from a population to give numerical information about the population itself. [analyzing the data to draw conclusions or answer questions about the population] Example: Determine which of the following is an example of descriptive statistics and which is an example of inferential statistics. a) The average weight of all football players on the NY Giants football team is 235 pounds. b) The average yearly salary of a random sample of 150 minor league baseball players is $102,000. Therefore, the average yearly salary of all minor league baseball players is $102,000. Probability is the measure of the likelihood that something happens/occurs and is very important in inferential statistics; it’s related to the risk of making an error. A variable is a characteristic that describes a person, place or thing being studied. Example: height, gender, weight, color 1 A raw score is an unaltered measurement obtained in a particular situation. It is the raw information from which statistics are created. A distribution is a collection of raw scores. Examples of distributions: Population – All people or things being considered in a particular situation EX: A parameter is a numerical value that summarizes or describes the whole population EX: Sample – any portion (subset) of a population under consideration EX: A statistic is a numerical value that summarizes or describes a sample EX: Note: A parameter goes with a population and a statistics goes with a sample. Examples: 1) To determine the average GPA of 500 students who just finished their first year in college, a group of 60 students is randomly selected. It is determined that the average GPA is 2.85. a) What is the population for this study? b) What constitutes the sample? c) Based on the sample, what is the statistic for the average GPA of the population? 2 2) Determine whether the number described is a parameter or statistic: a) In a recent survey of college graduates, 68% of those who responded said they had more than $50,000 in student loans. b) The average age of all the employees working at XYZ Company is 37 years. c) The average GPA of 250 randomly selected students at ABC University is 2.73. d) Of all the students attending Mercer County Community College in 2018, 66% were part time students. Random Sample: A sample selected in such a way that every member of the population has the same probability of being selected for the sample. A sample chosen at random is meant to be an unbiased representation of the total population. Note: the word “random” describes the process by which the sample is chosen and does not guarantee that the sample will be representative, but it allows us to determine the probability the sample is representative. Consider the following: Population: All students attending Mercer County Community College Variable: Some measure of mathematical ability Sample: Students leaving a section of calculus at MCCC. This is not a random sample from the population of all students at MCCC. From this sample we should not attempt to infer anything about the mathematical ability of all students at MCCC. Note: A bias in obtaining a sample will destroy the value of the statistical information obtained since statistical inferences made from this information would be invalid. That is why it is important to use random samples when doing statistical analysis. Example: Determine whether or not the sample given represents the given population accurately. a) Population: All students attending Mercer County Community College. Sample: 100 students selected at random entering the student center at noon on Monday. b) Population: All businesses in Mercer County. Sample: 75 businesses selected at random from a list of all businesses in Mercer County. Why use a sample instead of a population? a) b) c) d) 3 Two types of variables 1) Qualitative or categorical variable – classification based on some attribute or characteristic of the individual (non-numerical) EX: 2) Quantitative variable – provides numerical measures of individuals Two types of quantitative variables 1) Discrete – has either a finite number of possible values or a countable number of possible values (something that can be counted) EX: 2) Continuous – has an infinite number of possible values that are not countable (something that can be measured) EX: Example: The following data set provides information about five college professors. Name Allen Backer Hughes Ramirez Turner Specialty Gender Nursing F Accounting M Psychology F Mathematics F Sociology M Age 40 59 52 30 38 Height (in) 65 71 69 68 70 # of years of teaching 16 34 13 5 9 Rank Associate Professor Full Professor Associate Professor Assistant Professor Assistant Professor Which variables are qualitative and which are quantitative variables? Qualitative variables: Quantitative variables: 4 Section A: Homework 1) Determine which of the following is an example of descriptive statistics and which is an example of inferential statistics. a) The average height of all faculty at MCCC is 5 feet 11 inches. b) The average IQ of a random sample of 100 students at MCCC is 105. Therefore, the average IQ of all MCCC students is approximately 105. 2) Determine whether the number described is a parameter or a statistic. a) The average height of all football players on the Eagles Football team in 2016 was 73.72 inches. b) In a survey of 1000 college students, 37% believe they will have difficulty finding a job in their major field after graduation. c) In a recent poll, the average age of the respondents was 43 years. d) The average number of hours all students at Ivy University study per day is 2.36 hours. 3) To determine the average typing speed of 700 students who just finished Typing 101, a group of 20 students is randomly selected. It is determined that the average typing speed is 47 words per minute. a. What is the population for this study? b. What constitutes the sample? c. Based on the sample, what is the statistic for the average typing speed of the population? 4) Determine whether or not the sample given represents the given population accurately. a) Population: All students in MAT125 this semester. Sample: Every 5th name from a list of all students in MAT125 this semester. b) Population: All residents of Mercer County. Sample: 50 people are selected at random of those who live in Hamilton Square. c) Population: All residents of New Jersey. Sample: Selection of names at random from all New Jersey residents. 5 5) Which are qualitative and are which are quantitative (If quantitative, state discrete or continuous)? a) The number of people attending a Trenton Thunder Baseball game. b) Your cell phone number. c) The seating capacity of a football stadium. d) The amount of electricity used by a household during a given month. e) The name of your favorite movie. f) The amount of time you wait to see a doctor. g) The number of red cars in the MCCC west student parking lot. 6 Section B Measures of Center Descriptive statistics consists of methods to organize and summarize data clearly and effectively. Organizing and summarizing the data is useful, since it helps the researcher see the important aspects of the data collected. One way to describe a data set is to find numerical summaries of the data. The first type of numerical summary are called measures of center; these values describe the center of the data. Three Measures of Center: 1) Mean: balance, center of gravity, equal weight on each side of the mean 2) Median: cuts the data set in half, 50% on one side of the median 50% on the other 3) Mode: Most frequent value(s) in a distributed; used with both qualitative and quantitative variables 1) Mean – the mean is the point at which the data set would balance, affected by extreme values in the data set (nonresistant) Notation: Population mean: μ = ∑x N sample mean : x ̅= ∑x n where N is the population size and n is the sample size Note: Σx means add up the data values Example: 14 18 34 26 31 56 45 48 23 2) Median: Need to put data in ascending order. Not affected by extreme values in the data set (resistant). If the data set is odd, the median is the middle data value. If the data set is even, the median is the sum of the middle two numbers divided by 2. Example: Odd number of data values: 45 47 52 54 58 63 65 67 73 75 79 Even number of data values: 56 59 62 64 65 67 68 69 70 74 7 Comparing the mean and median to determine the shape of the distribution: Right Skewed: mean > median Left Skewed: mean < median Symmetric: mean = median Graphical representation: Left Skewed Right Skewed Symmetric (bell-shaped) Examples: 1) 3 4 5 6 7 8 9 2) 11 12 15 16 18 24 32 3) 1 2 8 9 10 11 12 13 14 3) Mode: The mode of a data set is the value that appears most frequently. If two or more values are tied for the most frequent, they are all considered to be modes. If the values all have the same frequency, we say that the data set has no mode. Examples: 1) red red blue blue yellow yellow green red blue green blue blue 2) 45 46 34 36 53 55 54 54 32 36 64 49 50 8 More Examples: 1) Given the following set of numbers: 209 214 220 224 224 229 239 241 245 246 247 249 255 Find: a) n = b) ∑ x = ______ c) Mean = ______ d) Median = ________ e) Mode = _________ f) Is the distribution right skewed, left skewed or symmetric? _______________________ 2) Given the following set of numbers : 10 12 14 15 17 21 22 23 a) n = _________ b) ∑ x = ________ c) x̅ = _______ d) ∑(x − 16.75)= ____________ e) If 60 is added to each data value of the above distribution, what will the mean of the resulting distribution equal? __________ f) If each data value of the above distribution is multiplied by 5, what will the mean of the resulting distribution equal? ___________ 3) Given the following table: Ice Cream Favor Frequency Chocolate 10 Strawberry 7 Vanilla 16 What is the mode?___________ 9 Section B: Homework 1) Given the following set of numbers: 22 28 36 42 42 42 52 53 54 55 56 57 58 59 Find: a) n = ______ b) ∑ 𝐱 = _______ b) Mean = ________ c) Median = _______ d) Mode = _______ e) Is the distribution right skewed, left skewed or symmetric? _________________ 2) Given the following set of numbers: 124 126 127 128 130 132 132 139 144 148 149 155 170 Find: a) n = ______ b) ∑ 𝐱 = _______ b) Mean = ________ c) Median = _______ d) Mode = _______ e) Is the distribution right skewed, left skewed or symmetric? _________________ 3) Observations of cars with frequencies in a parking lot: Make of Car Frequency Ford 23 Chevrolet 30 Kia 19 Pontiac 2 Buick 2 Cadillac 3 Mercury 4 Lincoln 2 Volkswagen 21 What is the mode?_________________ 4) Given the following set of numbers : 2 3 4 5 5 6 10 a) n = _________ b) ∑ x = ___________ c) x̅ = ___________ d) ∑(x − 5)= __________ e) If 350 is added to each term of the above distribution, what will the mean of the resulting distribution equal? _________________ 5) Given the following distribution of test scores: 85 72 9 83 81 85 93 0 82 85 a) Calculate the: mean = ___________, median = ____________, mode = ____________ b) Which of the three averages seems most meaningful in this situation? ___________ Why? 6) Professors were observed wearing the following colored shirts on a particular day. Find the “average” of this distribution and state which measure of central tendency it is: Color Frequency White 1 Blue 9 Yellow 4 Red 2 Pink 1 10 Section C Measures of Spread Another type of descriptive summary is called measures of spread or measures of variation. Measures of spread summarize the data in a way that shows how scattered the values are from each other and how much they differ from the mean value. Just as there are different measures of center there are different measures of spread such as range, variance, and standard deviation. Note, two data sets can have the same mean, median or mode, but be very different in their measure of spread, therefore it important to summarize the data using both a measure of center and a measure of spread. 1) The range of a data set is the difference between its largest value and its smallest value. Range = largest value – smallest value In using the range, a great deal of information is ignored since only the largest and smallest values are used to calculated the range, so the range is not a measure of spread used often in summarizing data. 2) Variance – is a measure of how far the values in a data set are from the mean, on average Deviation – the difference between a value and the mean, μ. deviation = x – μ If the deviation is positive the value lies above the mean. If the deviation is negative the value lies below the mean. The sum of the deviations equals zero: ∑(x − μ) = 0. Recall, the fact that the mean is the value where the data would balance, it makes sense that the deviations on either side of the mean would cancel each other out. Formula for population variance: 2 σ = ∑(x−μ)2 N = ∑ x2 N − μ2 where µ is the population mean and N is the population size When the data values come from a sample rather than a population, the variance is called the sample variance. The procedure for computing the sample variance is a bit different from the one used to compute a population variance. In the formula, the population mean μ is replaced by the sample mean x̅ and the denominator is n − 1 (n is the sample size) instead of N (the population size). The sample variance is denoted by 𝑠 2 . Formula for sample variance: 2 s = ∑(x−x̅)2 n−1 = n ∑ x2 −(∑ x)2 n(n−1) where x̅ is the sample mean and n is the sample size When computing the sample variance, s2, we use the sample mean, x̅ , to compute the deviations. For the population variance, σ2, we use the population mean, µ, for the deviations. It turns out calculating the deviations using the sample mean tend to be a bit smaller than the deviations using the population mean. If we were to divide by 𝑛 when computing a sample variance, the value would tend to be a bit smaller than the population variance. 11 It can be shown mathematically that the appropriate correction is to divide the sum of the squared deviations by n − 1 rather than n. Because the variance is computed using squared deviations, the units of the variance are the squared units of the data. In most situations, it is better to use a measure of spread that has the same units as the data. We do this simply by taking the square root of the variance. This quantity is called the standard deviation. The standard deviation of a population is denoted, σ, and the standard deviation of a sample is denoted, s. population standard deviation: σ = √σ2 sample standard deviation: s = √s2 In other words, the computational formula for sample standard deviation is as follows: s= n ∑ x2 −(∑ x)2 √ n(n−1) Note: ∑ x 2 ≠ (∑ x)2 For example: Given: 1, 2, 3, 4 ∑ x = 1 + 2 + 3 + 4 = 10 so (∑ x)2 = 102 = 100 ∑ x 2 = 12 + 22 + 32 + 42 = 30 and 30 ≠ 100 1) For the following set of numbers: 10 12 14 17 18 20 Find: a) n = _______ b) ∑ x =____________ c) ∑ x 2 = __________ d) s2 = _____________ e) s = _____________ 12 2) For the following set of numbers: 45 34 29 31 54 42 37 32 Find: a) n = _______ b) ∑ x =____________ c) ∑ x 2 = __________ d) s2 = _____________ e) s = _____________ 3) For the following set of numbers: 15 16 25 29 32 39 41 48 46 47 Find: a) n = _______ b) ∑ x =____________ c) ∑ x 2 = __________ d) s2 = _____________ e) s = _____________ f) Add 20 to each of the data values above, what is the new variance, s2 = __________ what is the new standard deviation, s = __________ g) Multiply each data value by 12, what is the new variance, s2 = _____________ what is the new standard deviation, s = ______________ 13 If a data set is approximately bell-shaped, the mean and standard deviation together can provide an approximate description of the data using the following rule: The Empirical Rule: When a population has a histogram that is approximately bell-shaped then Approximately 68% of the data will be within one standard deviation of the mean. (μ − σ, μ + σ) or (x̅ − s, x̅ + s) Approximately 95% of the data will be within two standard deviations of the mean. (μ − 2σ, μ + 2σ) or (x̅ − 2s, x̅ + 2s) Approximately 99.7% of the data will be within three standard deviations of the mean. (μ − 3σ, μ + 3σ) or (x̅ − 3s, x̅ + 3s) Examples: 1) IQ scores are approximately bell-shaped with a mean of 100 and standard deviation of 15. a) Between what two values will approximately 95% of the IQ scores be within?__________ b) About what percent of the IQ scores is between 85 and 115? __________ c) About what percent of the IQ scores is between 55 and 145? __________ 2) The heights of 2-year old girls are approximately bell-shaped with a mean of 34 inches and a standard deviation of 2.5 inches. a) About what percent of the heights is between 29 and 39 inches? _________ b) About what percent of the heights is between 31.5 and 36.5 inches? _________ c) Between what two values will approximately 99.7% of the heights be within?________ 14 Section C: Homework 1) Given the following set of numbers: 10 12 15 17 19 22 Find: a) n = _______ b) ∑ x =____________ c) ∑ x 2 = __________ d) s2 = _____________ e) s = _____________ f) Add 5 to each of the data values above, what is the new variance, s 2 = __________ what is the new standard deviation, s = ___________ g) Multiply each data value by 10, what is the new variance, s2 = _____________ what is the new standard deviation, s = ______________ 2) For the following set of numbers: 53 62 57 54 63 67 70 Find: a) n = _______ b) ∑ x =____________ c) ∑ x 2 = __________ d) s2 = _____________ e) s = _____________ f) Add 50 to each of the data values above, what is the new variance, s2 = __________ what is the new standard deviation, s = __________ g) Multiply each data value by 8, what is the new variance, s2 = _____________ what is the new standard deviation, s = ______________ 15 3) Last year the mean salary for professors in a particular community college was $62,000 with a standard deviation of $2000. A new two year contract is negotiated. a) In the first year of the contract, each professor receives a $1500 raise. Find the mean and standard deviation for the first year of the contract. b) In the second year of the contract, each professor receives a 3% raise based on their salary during the first year of the contract. Find the mean and the standard deviation for the second year of the contract. 4) The mean time a cell phone battery will hold its charge during moderate use is 650 minutes with a standard deviation of 75 minutes. Assume the data is approximately bell-shaped. a) Between what two values will approximately 95% of data fall? __________ b) About what percent of data is between 575 and 725? __________ c) About what percent of data is between 425 and 875? _________ 5) Monthly electric bills in New Jersey for a bell-shaped distribution with a mean of $109 and a standard deviation of $37. a) About what percent of the bills will be between $72 and $146? ___________ b) Between what two values will approximately 99.7% of the bills fall?_________ c) Between what two values will approximately 95% of the bills fall?____________ 16 Section D Measures of Position Another way to summarize a data set it to determine where data values lie within the data set. These measures of position are z-scores, percentiles, and quartiles. z-scores indicate the location of a value with respect to the mean of a data set. The z-score of a value expresses how many standard deviations above or below the value is from the mean. In addition, zscores provide a way to compare data sets which have different means and standard deviations. A z-score is calculated using the following formulas depending on whether you are using a population or a sample: Population z= sample x− μ σ or z= x− x̅ s If the z-score is positive (+), the value is above the mean. If the z- score is negative (−), the value is below the mean. The mean for distribution of z-scores is equal to zero and the standard deviation equal to 1. The mean for any distribution would have a z-score of 0. Solving the above equations for x: x = μ + zσ or x = x̅ + zs Examples: 1) Suppose you have a data set in which the population mean = µ = 45 and the population standard deviation = σ = 4 a) Find the z-score for 52. b) Find the z-score for 32. c) Find the x-value that corresponds to a z-score of −2.25. d) Find the x-value that corresponds to a z-score of 1.87. 17 2) Suppose you have a data set in which the sample mean = x̅ = 75 and the sample standard deviation = s = 10.6 a) Find the z-score for 57. b) Find the z-score for 95. c) Find the x-value that corresponds to a z-score of 2.65. d) Find the x-value that corresponds to a z-score of –2.38. 3) Before applying to colleges. Jimmy took both the SATs and the ACTs. He scored a 1250 on the SATs and a 28 on the ACTs. The mean and standard deviation for the SATs are 1059 and 210, respectively. The mean and standard deviation for the ACTs are 20.6 and 5.8, respectively. On which exam did he do relatively better on? Why? 4) During the year, Jennifer ran in a full marathon as well as a half marathon. She ran the full marathon in 262 minutes and the half marathon in 129 minutes. The mean and standard deviation for the full marathon are 287 minutes and 45 minutes, respectively. The mean and standard deviation for the half marathon are 143 minutes and 18 minutes, respectively. In which marathon did she do relatively better in? Why? 18 Percentiles indicate what percent of the values in the data set are below a particular data value. The kth percentile, denoted Pk, of a data set is a value such that k percent of the observations are less than or equal to the value. Note: The median would be at the 50 th percentile, i.e. median = P50. Example: Suppose 75 is at the 68th percentile (P68 = 75) this means that 68% of the data values are less than 75. Suppose 62 is at the 42nd percentile (P42 = 62) this means that 42% of the data values are less than 62. 1) In a particular county, records indicated the assessed value of each of the 150,000 houses there. The following percentiles were obtained. P15 = $120,000 P50 = $175,000 P65 = $210,000 P90 = $255,000 a) What percent of houses were assessed below $120,000?________________ b) What percent of houses were assessed below $255,000?________________ c) What percent of houses were assessed between $120,000 and $210,000?_______________ d) What percent of houses were assessed above $210,000?___________________ e) What percent of houses were assessed above $120,000?_________________ f) How many houses were assessed above $120,000?__________________ g) What is the median value of the assessed houses?____________________ 2) Phil Latelist gave a test on the history of stamp collecting to a class of 400 students. Five of the test scores with the corresponding z-score and percentile are given in the table below: Test Score Z-score Percentile 57 -2 10 65 -1 25 73 0 42 77 0.5 50 85 1.5 70 Answer the following questions about all 400 test scores. a) What is the mean of the 400 scores? _________ b) What is the median of the 400 scores?_________ c) What is the standard deviation of the 400 scores?___________ d) What percent of the scores lie between 85 and 65? ___________ e) How many scores lie above 57?__________ 19 A special type of percentile are the quartiles. Quartiles are the 25th, 50th and 75th percentiles. Denoted Q1, median, and Q3, respectively. Quartiles divide the data set into quarters or in other words four parts. Quartiles are used to determine the shape of a distribution and are used to determine if the data set has what are called outliers or extreme values; data values that differ significantly from the other observations in the data set. Example: Given the data set: 45 47 50 53 56 59 62 65 67 74 76 Find the quartiles. A way to describe a data set using quartiles is called the five-number summary. The five-number summary consists of the minimum, Q1, median, Q3, maximum written in this order. [min, Q1, median, Q3, max] For the data set above find the five-number summary. ______________________________ Interquartile Range (IQR) is the difference between the third and first quartiles of a data set. Note: The IQR is actually a measure of spread since it is the range of the middle 50% of the observations. IQR = Q3 – Q1 For the data set above, IQR = _______________ 20 Using the quartiles and IQR it can be determined if the data set contains outliers or not. An outlier is a value that is considerably larger or smaller than most of the values in a data set. Boundaries (fences) serve as cutoff points for determining outliers: Possible outlier boundaries: Lower Fence = LF = Q1 – 1.5(IQR) Upper Fence = UF = Q3 + 1.5(IQR) Extreme outlier boundaries : Lower Lower Fence = LLF = Q1 – 3(IQR) Upper Upper Fence = UUF = Q3 + 3(IQR) Therefore any values that are between the lower lower fence and the lower fence or between the upper fence and the upper upper fence are considered possible outliers. Any values less than the lower lower fence and greater than the upper upper fence are considered extreme outliers. LLF Extreme outliers (−∞, LLF) LF UF possible outliers (LLF, LF) UUF possible outliers (UF, UUF) extreme outliers (UUF, ∞) Example 1a: Given the following data set: 14 34 38 43 45 47 53 54 55 56 58 85 Find the five-number summary, the IQR and determine if there are any outliers. five – number summary _______________________ IQR = ______________ LF = ____________ UF = _______________ Outliers (if any)____________ 21 Modified Boxplots are a graphical display of quantitative data. Boxplots are created using the fivenumber summary and outliers, if any. Boxplots are useful for comparing two or more data sets. You can also use a boxplot to identify the approximate shape of the distribution of a data set especially for large data sets; histogram and stem-and-leaf plots are better graphical displays for small data sets. To draw a box-plot, Step 1) Draw lines at Q1, the median, and Q3 and draw a box using the lines. Step 2) Put an asterisk at the outliers, if any. Step 3) Draw the lower whisker out to either the minimum value, if there are no outliers, or to the smallest value that is not an outlier Step 4) Draw the upper whisker out to either the maximum value, if there are no outliers, or to the largest value that is not an outlier Example 1b: Draw a Modified Boxplot using the data in example 1a: 22 Example 2: Given the following data set: 67 68 78 79 80 81 83 85 86 87 88 89 90 91 92 93 95 Find the five-number summary, the IQR, determine if there are any outliers and draw a modified box plot. five-number summary ____________________ IQR = ___________ LF = ____________ UF = _____________ Outliers (if any) ___________ Modified Box-plot: Example 3: On a baseball team, the ages of each of the players are as follows: 19 24 24 25 25 25 26 26 26 26 27 27 27 28 28 31 Find the five-number summary, the IQR, determine if there are any outliers and draw a modified box plot. five-number summary_______________________ IQR = ____________ LF = _____________ UF = ____________ Outliers (if any) ______________ Modified box-plot: 23 4) The U.S. National Center for Health Statistics compiles data on the length of stay by patients in shortterm hospitals and publishes its findings in Vital and Health Statistics. A random sample of 21 patients yielded the following data on length of stay, in days. 1 9 1 9 3 3 4 4 5 6 6 7 10 12 12 13 15 18 23 55 7 Find the five-number summary, the IQR, determine if there are any outliers and draw a modified box plot. five-number summary_______________________ IQR = ____________ LF = _____________ UF = ____________ Outliers (if any)______________ Modified box-plot: 24 Section D: Homework 1) On a test, assume μ = 75 and σ = 5 a) If a person’s raw score is 68, find his z-score. ________________ b) If a person’s raw score is 92, find his z-score._________________ c) If a person’s z-score is 2.2, find their test score.________________ d) If a person’s z-score is −1.8, find their test score._______________ 2) On a history test Patrick received a grade of 43 out of 50. The class mean was 35 with a standard deviation of 6. His girlfriend Kristin, in another section, received a grade of 87 out of 100. Her class mean was 75 with a standard deviation of 10. Who received the higher grade relative to their respective classes? Explain your answer. 3) A representative from an auto company stated that the mean weight of the cars his company produces is 3900 pounds with a standard deviation of 200 pounds. He went on to say that the lightest car had a zscore of −2.3 while the heaviest one had a z-score of 3.1. Find the weight of the lightest car and the heaviest car produced by the company. 25 4) In a particular county, records indicated the assessed value of each of the 200,000 houses there. The following percentiles were obtained. P20 = $110,000 P50 = $160,000 P75 = $220,000 P95 = $260,000 a) What percent of houses were assessed below $110,000?________________ b) What percent of houses were assessed below $260,000?________________ c) What percent of houses were assessed between $110,000 and $220,000?_______________ d) What percent of houses were assessed above $220,000?___________________ e) What percent of houses were assessed above $110,000?_________________ f) How many houses were assessed above $110,000?__________________ 5) The following set of numbers represents the grades of 20 students in an Elementary Statistic I class. 52 72 74 74 76 78 79 80 82 83 84 85 85 86 87 87 88 90 91 103 a) Find the five-number summary ___________________________ b) IQR = _______________ c) Lower Fence = ______________ Upper Fence = ___________________ d) Outliers(if any): _______________________ e) Construct a Modified Box Plot. 26 6) The following data set represents the heights of a random sample of 25 adults in inches. 48 60 60 62 63 64 65 65 65 65 66 66 66 67 68 69 70 72 72 73 73 73 74 75 76 a) Find the five-number summary ___________________________ b) IQR = _______________ c) Lower Fence = ______________ Upper Fence = ________________ d) Outliers(if any): _______________________ e) Construct a Modified Box Plot. 7) Dr. Stanley Thomas of Texas State University has collected information on millionaires. The ages of 36 millionaires are as follows: 31 60 38 61 39 64 39 64 42 66 42 66 45 67 47 68 48 68 48 69 48 71 52 71 52 74 53 75 54 77 55 79 57 79 59 79 a) Find the five-number summary ___________________________ b) IQR = _______________ c) Lower Fence = ______________ Upper Fence = ___________________ d) Outliers(if any): _______________________ e) Construct a Modified Box Plot. 27 8) The following data is the age of U.S. Presidents on inauguration day. The data has been put in ascending order. There are 46 values. 42.88 51.47 56.03 64.27 43.65 51.96 56.18 64.61 46.42 46.85 47.46 47.96 48.28 49.29 49.33 50.5 51.02 51.08 51.09 52.05 52.3 54.09 54.24 54.41 54.54 54.56 55.24 55.33 55.54 55.96 56.29 57.18 57.65 57.89 57.97 58.85 60.93 61.07 61.34 61.97 62.27 65.86 68.06 69.96 70.6 78.17 a) Find the five-number summary ___________________________ b) IQR = _______________ c) Lower Fence = ______________ Upper Fence = _______________ d) Outliers(if any): _______________________ e) Construct a Modified Box Plot. 9) A particular statistics test was taken by 200 students. A few of the test scores are given in the following table along with their corresponding z-scores and percentiles. Test Score z-score Percentile 66 −1.5 20 69 −1 30 75 0 42 81 1 50 90 2.5 85 96 3.5 99 From only this table answer the following questions about the entire distribution of 200 scores: a) What is the mean of this distribution?_______________ b) What is the median of this distribution?_____________ c) What is the standard deviation of the distribution?____________ d) What is the variance of the distribution?_____________ e) What percent of scores are between the test scores 69 and 90?____________ f) How many scores are above 96?________________ g) Is the distribution right skewed, left skewed or symmetric? _________________ 28 10) The following are summary statistics for the final exam scores for students in psychology 101 during the spring semester. Mean = 68 Median = 65 First Quartile = 57 Mode = 72 Third Quartile = 84 Standard deviation = 2 60th Percentile = 73 P46 = 62 a) What final exam score did half the students’ scores surpass? ______________ b) What is the most common final exam score?________________ c) About what percent of the final exam scores are below 62? _____________ d) What percent of the final exam scores are above 73? __________ e) What final exam score is 1.25 standard deviations above the mean? _______________ f) About what percent of final exam scores are above 84? ___________ g) What final exam score is 2.5 standard deviations below the mean? ________ h) Suppose the final exam scores have a distribution that is bell-shaped, about what percent of of the final exam scores will be between 66 and 70? ____________________ 29 Section E Summarizing Qualitative In section A, two types of variables were discussed; qualitative (categorical) and quantitative. In this section, you will learn how to organize and summarize qualitative data using tables and graphs. The frequency of a category is the number of times it occurs in the data set. A frequency distribution lists each category of data and the number of occurrences for each category of data. The relative frequency is the ratio (proportion or fraction) of the frequency of each category to the total frequency and it is found by relative frequency = frequency Sum of all frequencies A relative frequency distribution lists each category of data together with the relative frequency. Bar graphs, and pie charts are devices to graphically represent qualitative data. Examples 1) The class levels of 25 students in an elementary statistics course are as follows; Fr So So Jr Jr Jr Sr Fr So Jr So So Jr Sr Sr So Jr So So Jr Sr Jr So So Sr a) Construct a frequency distribution. b) Add a relative frequency column to the frequency distribution. c) What percent of the data are sophomores? d) What is the mode? e) How many students are juniors? f) Construct a bar graph using the frequency of the data. g) Construct a pie chart using the relative frequency of the data. a and b) Class Level Frequency Relative Frequency 30 2) A researcher evaluated the taste of four leading brands of instant coffee by having a sample of 80 individuals taste each coffee and then select their favorite. The results are given: A D B C D B B D B C A B B B C C D B D B A A D B D B A B B C B D C A B B D D A B a) Construct a Frequency table. Brand Frequency B A B B B A B C C D C B A C B D D B A B A B D B B A D D D B C D B C B B B B D B b) Add a relative frequency column to the distribution you constructed in part (a). Round answers to two decimal places. Relative Frequency c) What percent of the people chose Brand C as their favorite? d) How many people chose Brand A as their favorite? e) What percent of the people chose Brand B or Brand D as their favorite? f) What is the mode? g) Construct a bar graph using the frequency of the data. h) Construct a pie chart using the relative frequency of the data. 31 Section E: Homework 1) The following is the eye color of a random sample of 35 patients who go to Hamilton Eye Care Associates. blue hazel green green brown brown brown blue green hazel green green brown brown blue hazel brown green brown green green blue brown brown brown blue green brown hazel green green brown brown blue hazel a) Construct a Frequency table. Category Frequency b) Add a relative frequency column to the distribution you constructed in part (a). Round answers to two decimal places. Relative Frequency c) What is the mode? d) How many patients have blue eyes? e) What percent of the patients have hazel eyes? f) What percent of the patients have brown or green eyes? g) Construct a bar graph. h) Construct a pie chart. 32 2) The network data for the Top 40 TV shows of all time by IGN Entertainment is as follows: HBO HBO HBO NBC HBO NBC BBC NBC CBS CBS NBC FOX AMC FOX NBC FOX a) Construct a Frequency table. Category NBC CBS NBC COMEDY CENTRAL COMEDY CENTRAL HBO ABC CBS AMC NBC HBO FOX ABC SCI-FI NBC BBC FOX NBC COMEDY CENTRAL PBS ABC THE WB HBO PBS b) Add a relative frequency column to the distribution you constructed in part (a). Round answers to three decimal places. Frequency Relative Frequency c) How many of the Top 40 TV shows of all time were on NBC? d) What percent of the Top 40 TV shows of all time were on HBO? e) How many of the Top 40 TV shows of all time were on ABC, NBC or CBS? f) What percent of the Top 40 TV shows of all time were on FOX or COMEDY CENTRAL? g) What is the mode? 33 Section F Summarizing Quantitative Data Recall, there are two types of quantitative data; discrete (countable) and continuous (measurable). In this section, you will learn how to organize and summarize the two types of quantitative data using tables and graphs. Classes are distinct data values or intervals of equal width that cover all the values in a data set. Organizing Discrete Data in a Table: use the values of the discrete variable to create classes when the number of distinct data values is small. Organizing Discrete/Continuous Data in a Table: when a data set consists of a large number of different discrete data values or when a data set consists of continuous data, we must create classes by using intervals of numbers. Width of each class/interval must be the same. Lower class limit: the smallest value that can go in the class Upper class limit: the smallest value that can go in the next higher class; the upper class limit of the class is the same as the lower class limit of the next higher class Class width: the difference between the upper and lower class limits To make creating frequency tables easier we will use the symbol ⪪ which means “up to but not including”. For example, an interval written 55 ⪪ 65 would contain data values 55 up to but not including 65. The frequency of a class is the number of observations in the class. A frequency distribution lists each class together with its frequency. The relative frequency is the ratio (proportion or fraction) of the frequency of each class to the total frequency and it is found by frequency relative frequency = Sum of all frequencies A relative frequency distribution lists each class together with its relative frequency. A histogram is a graphical display of a quantitative frequency table and it is constructed by drawing rectangles for each class of data on the xy- coordinate system. x –axis is the class limits and the y-axis is the frequency or relative frequency of the class. Width of each rectangle are the same Rectangles touch 34 Identifying the shape of a Distribution using a histogram Uniform Symmetric (Bell-shaped) Right Skewed Left Skewed Mode – peak or high point of a histogram: unimodal – one mode, bimodal – two modes Examples 1) A researcher with A.C. Nielson wanted to determine the number of televisions in households. He conducts a survey of 40 randomly selected households and obtains the following data. 1 1 4 2 3 3 5 1 1 2 2 4 1 1 0 3 1 2 2 1 3 1 1 3 2 3 2 2 1 2 3 2 1 2 2 2 1 3 1 3 a) Construct a frequency table. Class Frequency b) Add a relative frequency column to the frequency table you constructed in part (a). Round answers to three decimal places. Relative Frequency c) How many households have at least 3 televisions? d) What percent of households have 1 television? e) Construct a frequency histogram of the data. f) Describe the shape of the distribution. 35 2) The Jefferson National Bank has five tellers available to serve customers. The data in the following table provide the number of busy tellers observed at 30 spot checks. 5 3 4 5 4 2 1 4 5 3 5 4 1 5 5 0 5 4 a) Construct a frequency table. Class Frequency 5 4 3 4 2 3 0 2 1 4 2 3 b) Add a relative frequency column to the frequency table you constructed in part (a). Round answers to three decimal places. Relative Frequency c) How many times are more than 4 tellers busy? d) What percent of the time are less than 3 tellers busy? e) Construct a histogram of the data. f) Describe the shape of the distribution. 36 3) The exam scores for the 25 students in an introductory statistics class are as follows: 34 39 54 58 60 63 64 67 70 75 77 78 76 81 82 84 85 86 88 89 89 90 96 96 99 a) Construct a frequency table. (Starting with 30 ⪪ 40) Interval 30 ⪪ 40 b) Add a relative frequency column to the frequency table you constructed in part (a). Round answers to two decimal places. Frequency Relative Frequency c) How many students had exam scores between 70 and 90, including 70 but not including 90? d) What percent of the students had exam scores less than 60? e) Construct a frequency histogram of the data. f) Describe the shape of the distribution. 37 4) The Food and Nutrition Board of the National Academy of Sciences states that the recommended daily allowance of iron is 18mg for adult females under the age of 51. The amounts of iron intake, in milligrams, during a 24-hour period for a sample of 45 such females follows. 9.1 12.5 14.4 16.0 18.1 9.4 12.6 14.5 16.3 18.1 10.7 12.8 14.6 16.3 18.2 10.9 13.1 14.6 16.4 18.3 11.0 13.4 14.7 16.6 18.3 a) Construct a frequency table, using 6 classes starting with a value of 9. Interval 11.5 13.6 15.0 16.6 18.6 11.8 13.6 15.1 16.8 19.5 12.2 13.8 15.3 17 19.8 12.3 14.2 15.6 17.3 20.7 b) Add a relative frequency column to the frequency table you constructed in part (a). Round answers to three decimal places. Frequency Relative Frequency c) How many females had an iron intake of at least 15 milligrams? d) What percent of the females had an iron intake of between 9 and 17, including 9 but not including 17? e) Construct a frequency histogram of the data. f) Describe the shape of the distribution. 38 Section F: Homework 1) An anthropologist takes a random sample of 30 households and finds the following number of people living in each household. 3 4 1 2 5 6 4 4 2 2 4 4 3 5 4 6 5 3 4 4 5 4 3 4 4 3 5 4 3 6 a) Construct a frequency table. Class Frequency b) Add a relative frequency column to the frequency table you constructed in part (a). (Round to two decimal places.) Relative Frequency c) How many households contain 4 people? d) What percent of the households contain between 2 and 5 people, inclusive? e) Construct a frequency histogram. f) Describe the shape of the distribution. 39 2) The following data set is a random sample of 42 games with the total number of runs scored in the game over the course of a softball season. 4 5 0 1 3 4 7 2 1 8 4 3 4 6 5 3 3 5 6 4 5 5 6 3 0 2 4 1 3 2 4 3 4 5 3 4 4 3 2 5 4 4 a) Construct a frequency table. Class Frequency b) Add a relative frequency column to the frequency table you constructed in part (a). (Round to two decimal places.) Relative Frequency c) How many games had 3 total runs scored? d) What percent of the games had a total of 4 runs scored? e) Construct a frequency histogram. f) Describe the shape of the distribution. 40 3) Lisa Hertscar, a civil engineer, needs to determine if a traffic light needs to replace a stop sign at a particular intersection. She keeps track of the number of cars that enter the intersection at randomly chosen times of the day between 8am and 10pm over the course of 60 days. The results are as follows: 65 15 23 72 20 56 32 55 52 27 51 35 47 63 36 38 26 56 46 52 48 62 33 70 57 44 47 43 41 46 38 53 51 45 57 62 60 21 47 55 53 46 37 43 56 58 66 46 49 68 32 55 49 42 68 53 46 57 52 57 a) Construct a frequency table. (Starting with 15 ⪪ 25) Interval 15 ⪪ 25 b) Add a relative frequency column to the frequency table you constructed in part (a). (Round to two decimal places.) Frequency Relative Frequency c) How many days of the week were there more than 35 cars at the intersection? d) Construct a frequency histogram. e) Describe the shape of the distribution? f) Do you think there needs to be a traffic light at the intersection? Why? 41 4) The following data shows the number of minutes a random sample of 50 patrons needed to wait to renew their license at the Bakers Basin location. 20.3 13.5 55.2 18.6 65.7 37.3 38.7 41.3 43.5 32.7 20.7 27.4 25.1 27.3 37.2 35.9 27.4 47.3 55.8 40.3 46.2 38.1 53.2 57.2 63.0 51.5 40.7 36.5 19.5 26.4 32.5 23.2 53.2 47.2 43.9 55.6 65.2 31.7 42.6 44.7 53.8 35.2 25.7 31.3 47.3 42.3 32.5 22.7 42.7 37.2 a) Construct a frequency table. (Starting with 10 ⪪ 20) Interval 10 ⪪ 20 b) Add a relative frequency column to the frequency table you constructed in part (a). (Round to two decimal places.) Frequency Relative Frequency c) How many people waited between 40 and 60 minutes, including 40 but not including 60? d) What percent of the people waited between 10 and 30 minutes, including 10 but not including 30? e) Construct a histogram. f) Describe the shape of the distribution? 42 5) The following histogram was the result of a study done by Justin Time to determine the time it took students to write a computer program and run it successfully. a) How many students participated in Justin’s study? b) How many students in the study took 5.5 or more hours to write and successfully run their program? c) What percent of students in the study wrote and successfully ran their programs in less than 4.5 hours? d) What percent of students took from 3.5 hours up to but not including 6.5 hours to write and successfully run their programs? e) In which of the five intervals would the median time be? 43 Section G Summarizing Quantitative Data (Continued) Stem-and-Leaf plots – simple way to display small data sets (similar to histogram). The number get split into a stem part and a leaf part. The stem part can be any number of place values, but the leaf part can only be one place value, usually the smallest place value in the number. 1) The following table presents the daily high temperatures for West Windsor Township, NJ, in degrees Fahrenheit, for the winter months of January and February, 2018. 19 41 37 51 24 55 47 46 30 52 34 52 27 55 38 56 17 61 47 61 15 45 65 18 36 65 32 40 39 44 58 53 42 54 63 56 45 60 65 38 42 63 31 45 26 44 49 32 37 71 42 32 78 35 38 57 32 38 44 a) Construct a stem-and-leaf plot. What is the shape of the distribution?___________________ b) Repeat part (a), but split the stems, using two lines for each stem. 44 2) A pediatrician who tested the cholesterol levels of several young patients was alarmed to find that many had levels over 200 mg per 100 mL. The readings of 20 patients with high levels are presented in the following table. Construct a stem-and-leaf plot of the data and describe the shape of the distribution. 220 217 209 165 212 210 208 223 202 235 218 221 196 213 214 210 188 199 210 208 3) The Food and Nutrition Board of the National Academy of Sciences states that the recommended daily allowance of iron is 18mg for adult females under the age of 51. The amounts of iron intake, in milligrams, during a 24-hour period for a sample of 45 such females follows. Construct a stem-and-leaf plot of the data and describe the shape of the distribution. 6.3 12.1 14.4 16.0 18.1 9.4 12.4 14.5 16.3 18.1 10.7 12.5 14.6 16.3 18.2 10.9 12.5 14.6 16.4 18.3 11.0 12.5 14.7 16.6 18.3 11.5 12.6 15.0 16.6 18.6 11.5 12.7 15.0 16.8 19.5 11.6 12.8 15.3 17.0 19.8 11.9 13.1 15.6 17.3 20.7 45 Back-to-back stem-and-leaf plot – used to compare two data sets. Example: Following are the running times (in minutes) for the 15 top-grossing movies rated G or PG and the top 15 top-grossing movies rated R of all time, as of August 2018. Movies Rated G or PG Incredibles 2 Beauty and The Beast (2017) Finding Dory Star Wars: Episode I – The Phantom Menace Star Wars Shrek 2 E.T.: The Extra-Terrestrial The Lion King Toy Story 3 Frozen Finding Nemo The Secret Life of Pets Despicable Me 2 The Jungle Book (2016) Inside Out 118 129 103 133 121 93 117 89 103 108 104 90 98 105 94 Movies Rated R The Passion of Christ Deadpool American Sniper It Deadpool 2 The Matrix Reloaded The Hangover The Hangover Part II Beverly Hills Cop The Exorcist Logan Ted Saving Private Ryan 300 Wedding Crashers 126 106 132 135 119 138 96 102 105 122 135 106 170 117 113 a) Construct a back-to-back stem-and-leaf plot for these data sets. b) Do the running times of R-rated movies differ greatly from the running times of movies rated G or PG, or are they roughly similar? 46 Section G: Homework 1) The exam scores for the students in an introductory statistics class are as follows: 34 39 63 64 67 70 75 76 81 82 84 85 86 88 89 89 90 96 96 99 102 Construct a stem-and-leaf plot and describe the shape of the distribution. 2) Construct a stem-and-leaf plot for the following data and describe the shape of the distribution. 56 31 42 34 78 16 78 98 19 4 96 25 27 53 31 17 21 50 25 6 37 45 49 92 47 54 103 23 38 48 47 24 18 58 94 77 47 3) A soft-drink bottler sells “one-liter” bottles of soda. A consumer group is concerned that the bottler may be shortchanging customers. Thirty bottles soda are randomly selected. The contents, in milliliters, of the bottles chosen are shown below. 1025 986 1006 977 963 1030 1018 1010 991 975 988 999 977 1028 997 990 989 996 986 1001 1014 1004 984 993 1031 974 995 964 1017 987 Construct a stem-and-leaf plot and describe the shape of the distribution. Is the bottler shortchanging customers? 4) A sample of 35 liberal-arts graduates yielded the following starting annual salaries. Data are in thousands of dollars, rounded to the nearest hundred dollars. 49.0 45.8 50.3 49.6 50.0 47.7 51.8 47.3 46.7 47.0 48.1 50.1 43.6 48.0 47.7 49.8 46.4 46.1 48.5 48.9 48.2 48.1 46.2 47.3 51.7 49.0 48.2 49.9 48.1 49.8 49.5 50.4 45.3 45.3 46.5 Construct a stem-and-leaf plot and describe the shape of the distribution. 48 5) The following back-to-back stem-and-leaf plot represent the results of two random samples obtained by Millie Gramm. The first random sample consisted of weights of carry-on luggage used by business travelers at an airport. The second random sample consisted of weights of carry-on luggage used by nonbusiness travelers at the same airport. Business 69 799 44468 23699 178 24 0 1 2 3 4 5 Non-Business 89 2358 11347889 022246 a) What is the mode for the business travelers?___________ b) What is the mode for the non-business travelers?________ c) Which group is more symmetrical? ____________ 49 Section H Misleading Graphs Statistical graphs, when properly used, are powerful forms of communication. Unfortunately, when graphs are improperly used, they can misrepresent the data and lead people to draw incorrect conclusions. Three of the most common forms of misrepresentation: 1) Incorrect position of the vertical scale 2) Incorrect sizing of graphical images 3) Misleading perspective for three-dimensional diagrams 1) The baseline of a graph or plot is the value at which the horizontal axis intersects with the vertical axis. With graphs or plots that represent how much or how many of something, it may be misleading if the baseline is not at zero. Average Cost of a House Per Year 450000 400000 350000 300000 250000 200000 150000 100000 50000 0 Average Cost Average Cost Average Cost of a House Per Year 2016 2017 2018 390000 385000 380000 375000 370000 365000 360000 355000 350000 345000 2019 2016 Year 2017 2018 2019 Year 2) Area Principle: When amounts are compared by constructing an image for each amount, the areas of the images must be proportional to the amounts. For example, if one amount is twice as much as another, its image should have twice as much area as the other image. The average sales price of a house in 1990 was $149,800 and in 2020 the average sales price of a house had risen to $389,400. Note that the price in 2020 is about 2.6 times the price in 1990. Average Sales Price of a House 1990 2020 50 3) 3-D graphs are often drawn as though the reader is looking down on them. This makes the bars look shorter than they really are. Average Sales Price of a House 400000 350000 300000 250000 200000 150000 100000 50000 0 1990 2020 51 Section H: Homework 1) The following graphs represent the number of people who purchased a particular item online from Jennifer’s Jewelry store during the years 2018, 2019 and 2020. Which graph is misleading? Why? Number Sold Number Sold 500 500 450 400 350 300 250 200 150 100 50 0 490 480 470 460 450 440 430 2018 2019 2020 2018 Graph 1 2019 2020 Graph 2 2) The number of girls who played softball in 2020 has tripled from the number of girls who played softball in 2010. a) Does the pictograph below accurately present this information accurately? b) Why or why not? 2010 2020 3) Explain why 3-dimensiontal graphs can be misleading. 52