Math 123- Statistics Chapter 2 Notes w/ Alpaca Data Name_______________________________ 2.1 Frequency Distributions and Their Graphs Def- A frequency distribution is a table that shows classes or intervals of data with a count of the number of entries of each class. Def- The frequency of a class is the number of data entries in the class. ExClass Limits 1–4 5–8 9 – 12 13 – 16 Frequency 2 4 1 7 The lower class limits are __________________. The upper class limits are __________________. The class width is _____. The range is _____. To Construct a Frequency Distribution: 1. Decide on the number of classes to use. 2. Find the class width by determining the range of the data set and dividing this by the number of classes, then round up to the next whole number. 3. Find the class limits. 4. Make tally marks for each data entry. 5. Count tally marks to find the frequency. Ex- Construct a frequency distribution (frequency table) for the length of gestation using the alpaca data. Use four classes. Class Limits Class Boundaries Frequency Midpoint Relative Frequency Def- The midpoint of a class is the average of the lower and upper class limits. Midpoint= Cumulative Frequency Def- The relative frequency is the portion (or percent) of data that falls into that class. Relative frequency= Def- The cumulative frequency of a class is the sum of the frequencies for that class and all previous classes. Ex- For the length of gestation using the alpaca data, find the midpoints, relative frequency, and cumulative frequency. (Table is on previous page.) Def- A frequency histogram is a bar graph that represents the frequency distribution of a data set with the following properties: 1. The horizontal scale is quantitative and measures the data values. 2. The vertical scale measures the frequency. 3. Consecutive bars must touch. Def- Class boundaries are the numbers that separate classes without forming gaps between them. To find and upper class boundary, add .5 to the upper class limit. Ex- Create a frequency histogram for the length of gestation using the alpaca data. Def- A frequency polygon is a line graph that emphasizes the continuous changes in frequency. 1. Make a sketch where the x-axis contains the midpoints and the y-axis contains the frequency. 2. Plot points using midpoints, then connect the dots. 3. Draw extra dots on both ends to extend to the x-axis. Ex- Create a frequency polygon using the length of gestation for the alpaca data. Def- A relative frequency histogram has the same shape and horizontal scale as the corresponding frequency histogram, but the y-axis is the relative frequency. Ex- Draw a relative frequency histogram for the length of gestation using the alpaca data. Def- A cumulative frequency graph or ogive is a line graph that displays the cumulative frequency on the y-axis and the class boundaries on the x-axis. The first point is always on the x-axis. Ex- Construct an ogive for the length of gestation using the alpaca data. 2.2 More Graphs and Displays Def- A stem-and-leaf plot is another way to display quantitative data. The stem contains the leftmost digits and the leaf contains the right-most digit. Def- A split-stemplot is a type of stem-and-leaf plot where two or more lines are used for each stem. Ex- Make a stem-and-leaf plot for the birth weight of alpacas with names beginning with letters between M and Z (inclusive). Ex- Make a split-stemplot using two lines for each stem. Def- In a dotplot, each data entry is plotted using a point above the x-axis. Ex- Construct a dotplot of the data. The data is the daily high temperature during January 2000 in Chicago, IL. 33, 31, 25, 22, 38, 51, 32, 23, 23, 34, 44, 43, 47, 37, 27, 25, 28, 35, 21, 24, 20, 19, 23, 27, 24, 13, 18, 28, 17, 25, 31 Def- A pie chart is a graph that shows relationships of parts to a whole. To do this, find the relative frequency for each data entry. Ex- Make a pie chart for the grade of fiber using the alpaca data. Grade 1 Grade 2 Grade 3 Grade 4 Frequency 15 7 9 1 Died/Sold 4 Def- A Pareto chart is a vertical bar graph in which the height of each bar represents the frequency or relative frequency. The bars are positioned in order of decreasing height. Note: A Pareto chart is a type of bar graph. Bar graphs are graphs where the bars do not touch and are used for qualitative data. Ex- Construct a Pareto chart for the grade of fiber using the alpaca data. Def- If two data sets have the same number of entries and each entry in the first data set corresponds to one entry in the second data set, then the sets are called paired data sets. Ex- When a plant’s age is compared to a plant’s height, this is a paired data set because each plant has an age and a height. Def- A scatterplot is used to graph paired data sets. The ordered pairs are graphed as points on a coordinate plane. Ex- The following are the height and number of stories of several buildings in Miami. Create a scatterplot using the data. Height (feet) 764 625 520 510 484 480 450 430 410 Number of Stories 55 47 21 28 35 40 33 31 40 Def- A data set that has entries at regular intervals of time is called a time series. Def- A time series chart is used to graph a time series. Ex- Make a time series plot using the data below. The data is based on the number of classes that Miss Sutter taught during the specified year. Year # Classes Taught 2000 2001 2002 2003 2004 2005 2006 2007 Ex- Find the class limits and class boundaries for the set of data. Use five classes. Check your answers with others around you. Data: 23, 23, 25, 26, 27, 28, 28, 29, 29, 29, 29, 33, 33, 33, 33, 33, 35, 36, 36, 36, 37, 37, 38, 38, 39, 39, 39, 42, 42, 42, 42, 43, 43, 43, 44, 44, 44, 44, 45, 46, 46, 46, 47, 49, 49, 49 Ex- Use your knowledge of graphs to fill in the following chart. Type of Graph Uses Class Uses Class Has Bars Qualitative Quantitative Boundaries Midpoints Data Data Freq Histogram Freq Polygon Rel Freq Hist Ogive(Cum Freq) Stem-and-leaf Plot Dotplot Pie Chart Pareto Chart Scatterplot Time-Series Plot 2.3 Measures of Central Tendency Def- A measure of central tendency is a value that represents a typical or central entry of a data set. The most common types are mean, median, and mode. Def- The mean of a data set is the sum of the data entries divided by the number of entries. Population Mean = Sample Mean = Def- The median of a data set is the middle entry when the data set is ordered in ascending order. If the data set has an even number of entries, the median is the mean of the two middle data entries. Def- The mode of a data set is the data entry that occurs with the greatest frequency. If no entry is repeated, then the data set has no mode. If two entries occur with the same greatest frequency, each entry is a mode and the data set is called bimodal. Ex- Find the mean, median, and mode of each set of data. a) The price of gas per gallon in 1995. $1.75, $1.35, $1.39, $1.69, $1.27, $1.53 b) The length of a child’s foot measured in centimeters. 9, 9, 10, 10, 10, 11, 12, 12, 13, 14, 14, 15 Def- An outlier is a data entry that is far removed from the other entries in the data set. Def- A weighted mean is the mean of a data set whose entries have varying weights. A weighted ( xw) where w is the weight of each entry x. mean is given by the formula x w Ex- Six test are given in a class. The first five tests are each worth 15% of the grade and the last test is worth 25% of the grade. Find the weighted mean of the test scores where the scores are 75, 67, 86, 77, 79, 88. Def- The mean of a frequency distribution for a sample is approximated by the formula x where x= midpoint of the class, f is the frequency of a class, and n f . xf n Ex- Find the mean of the frequency distribution. The data is the length of gestation using the alpaca data. Length of Frequency Midpoint Gestation (days) of Class 316 – 332 13 333 – 349 12 350 – 366 7 367 – 383 2 384 – 400 2 Shapes of Distributions Symmetric, Uniform (rectangular), Skewed Left (negatively skewed), Skewed Right (positively skewed) Def- A frequency distribution is symmetric when a vertical line can be drawn through the middle of the graph and both halves look the same. Def- A frequency distribution is uniform or rectangular if all classes have equal frequencies. Def- A frequency distribution is skewed left or negatively skewed if its tail extends to the left. Def- A frequency distribution is skewed right or positively skewed if its tail extends to the right. Ex- Match the distribution with one of the graphs stated in the definitions above. a) The frequency distribution of all ages of people who have played baseball. b) The frequency distribution of scores on a test where a majority of the class receives low scores. Ex- Find the mean of the data set shown below. 2.4 Measures of Variation Ways to Measure the Variation of a Data Set Range, Deviation, Population Variance, Population Standard Deviation, Sample Variance, Sample Standard Deviation Def- The range of a data set is the difference between the maximum and minimum data entries. Range= Def- The deviation of an entry x in a population data set is the difference between the entry and the mean of the data set. Deviation of x= Def- The population variance of a population data set of N entries is 2 (x ) N 2 . Def- The population standard deviation of a population of N entries is the square root of the population variance. (x ) N 2 Visual Representation of a Standard Deviation Data set for the population: 5, 6, 7, 8, 9, 10, 11 Distances from 2 : 32 ,2 2 ,12 ,0 2 ,12 ,2 2 ,32 Average of distances from 2 : 9 4 1 0 1 4 9 28 4 (variance) 7 7 Standard deviation: 42 Ex- Find the population variance and population standard deviation of the data set. The data consists of the number of inches of snow on Mt. Shasta during a 10 day period. 15, 10, 11, 4, 17, 19, 28, 35, 9, 18 Def- The sample variance of a set of n entries is s 2 (x x) n 1 2 . Def- The sample standard deviation of a set of n entries is the square root of the sample variance. s (x x) n 1 2 Ex- The following data is a sample of teacher’s salaries. Find the sample variance and sample standard deviation. $46098, $36259, $35084, $38617, $42690, $26202, $47169, $37109 Note: The standard deviation tells you the average distance that the data values are from the mean. A large standard deviation indicates that the data are spread out while a small standard deviation indicates that the data are clumped together. Empirical Rule (68%-95%-99.7% Rule) For data with a symmetric bell-shaped distribution, 68% of the data lies within 1 standard deviation of the mean, 95% of the data lies within 2 standard deviations of the mean, and 99.7% of the data lies within 3 standard deviations of the mean. Ex- The average rate for cable TV from a sample of households was $29 per month with a sample standard deviation of $2.50 per month. Assume that the distribution of cable TV prices is bellshaped. a) Between which two values does 99.7% of the data lie? b) Estimate the portion of cable TV bills that are between $26.50 and $31.50. c) Estimate the proportion of cable TV bills that are between $26.50 and $36.50. Chebychev’s Theorem (Applies to all distributions) The portion of any data set lying within k standard deviations of the mean is at least 1 1 . k2 For k = 2, at least _____ of the data lies within 2 std. dev. of the mean. For k = 3, at least _____ of the data lies within 2 std. dev. of the mean. Ex- In a sample of 40 customers, each customer spends an average of $23 at the grocery store with a sample std. dev. of $6. Using Chebychev’s Theorem, at least what percent of customers spend between $11 and $35 at the grocery store? Note: You will never be tested on Chebychev’s Theorem. Def- The sample standard deviation for a frequency distribution (grouped data) is s where x= midpoint of the class, x xf n (x x) n 1 and n f . Ex- Find the mean and standard deviation for the following data. The data is for the length of gestation using the alpaca data. (More room to write is given on the next page.) Length of Gestation (days) 316 – 332 333 – 349 350 – 366 367 – 383 384 – 400 Frequency 13 12 7 2 2 Midpoint of Class 324 341 358 375 392 2 f Def- The coefficient of variation, CV, of a data set describes the standard deviation as a percent of the mean. s Population: CV (100%) Sample: CV (100%) x Note: The coefficient of variation measures the variation of a data set relative to the mean of the data set. Ex- Find the coefficient of variation for the two data sets, then compare the results. The following is a sample of home prices in California and Kentucky (in thousands of dollars). CA: 760, 317, 490, 324, 568, 736, 488, 172, 224, 374 KY: 270, 399, 230, 499, 140, 120, 99, 250, 325, 300, 235 2.5 Measures of Position Def- Fractiles are numbers that partition an ordered data set into equal parts. Def- The quartiles Q1 , Q2 , and Q3 divide an ordered set into four equal parts. _____ of the data fall at or below Q1 . _____ of the data fall at or below Q2 . _____ of the data fall at or below Q3 . Ex- Find the quartiles of each data set. a) 1, 1, 2, 3, 5, 6, 7, 7, 7, 9, 10 b) 2, 3, 6, 6, 7, 8, 9, 14 Def- The interquartile range, IQR, is the difference between the first and third quartiles. IQR= Five Number Summary Minimum (smallest data value), Q1 , Q2 (median), Q3 , and Maximum (largest data value) Def- A box-and-whisker plot or boxplot uses the five number summary and includes all data values. Ex- Draw a boxplot for each set of data. a) 11, 12, 14, 17, 17, 18, 18, 18, 20, 31 b) 9, 9, 9, 11, 13, 14, 14, 15, 16, 16, 17 Def- For whole numbers P such that 1 P 99 , the P th percentile of a distribution is a value such that P% of the data fall at or below it and (100 – P)% of the data fall at or above it. Ex- If George’s test score was at the 54th percentile, then 54% of the class scored at or below George’s score and 46% of the class scored at or above George’s score. Ex- If Mary got a 95% on her test, what percentile rank does her score correspond to? Note: The following topic of z-scores will be covered in Chapter 5. Def- The standard score or z-score represents the number of standard deviations a given data value, x, falls from the mean . z-score z=