Math 103 Lecture 9 notes page 1 Math 103 Lecture 9 class notes Statistics – from the Latin staticus – “out of state” is the study of methods of collecting, organizing, presenting, analyzing, and drawing conclusions about data, commonly in numerical form. The three branches of statistics are: descriptive, inferential, and survey/sampling. Descriptive Statistics: organizing, summarizing, graphing and presenting data I. Organize data into frequency tables a. class and frequency b. extended table includes relative frequency, cumulative frequency, and cumulative relative frequency, as well as class marks II. Make charts or graphs a. histogram and bar graphs b. frequency curve or polygon c. ogive d. box & whisker or boxplot e. circle or pie graph f. stem & leaf g. pictographs h. scatter plots i. pictographs j. line plots III. Calculate measures a. central tendency (mean, median, mode) b. variation (range, standard deviation) c. position (percentiles, quartiles) I. Organize data into frequency tables Frequency Table = is an excellent device for making larger collections of data much more intelligible. A frequency table is so named because it lists categories of scores along with their corresponding frequencies. The frequency for a category or class is the number of original scores that fall into that class. The columns of an extended frequency table generate various graphs or charts. Extended frequency tables therefore become important prerequisites for creating graphs and charts used in statistics. Guidelines for frequency tables: 1. Class intervals should not overlap. Classes are mutually exclusive. 2. Classes should continue throughout the distribution with NO gaps. Include all classes. 3. All classes should have the same width. 4. Class widths should be “convenient” numbers. 5. Use 5-20 classes. 6. Make lower or upper limits multiples of the width. An extended frequency table includes the following: a. class intervals (lower and upper limits) b. marks c. frequency d. cumulative frequency Math 103 Lecture 9 notes page 2 e. f. relative frequency cumulative relative frequency Example Data Set: Dr. Brown’s Exam Scores 98 90 85 84 81 79 76 98 90 85 83 80 79 75 93 88 85 82 80 78 75 93 87 84 82 79 77 74 91 86 84 81 79 77 74 note: Typically, you will have to rank data first; data 73 69 60 72 68 60 71 67 59 70 64 57 70 63 54 does not usually come ordered! The first thing to do with numerical data is to organize it into a frequency table. Each column of a frequency table generates (is used to create) a particular graph or chart. class freq Extended Frequency Table of Dr. Brown’s Exam Scores cumulative relative cumulative mark freq. freq. relative freq. boundaries 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100+ The width of each class is 5 (size of each class). The lower limits are the smaller numbers of each class (50, 55, 60, 65, 70, etc.) The upper limits are the larger numbers of each class (54, 59, 64, 69, 74, etc.) Note: the class limits (either lower or upper) should be a multiple of the width. The mark is the midpoint of each class. Only the last class can be "open-ended." There should be no "gaps" in organizing classes. There should be no "overlap" in class numbers. II. Make charts or graphs Histogram: a type of bar graph representing an entire set of data. It is helpful when you need to discover or display the distribution of interval or ratio data. Histograms illustrate central tendency, shape, and how the data is spread out or dispersed. A histogram is made up of the following components: 1. a title, which identifies the population of concern 2. a vertical scale, which identifies the frequencies in the various classes Math 103 Lecture 9 notes page 3 3. a horizontal scale, which identifies the variable. Values for class boundaries, class limits, or class marks may be labeled along the axis. Shapes of histograms: symmetrical, uniform, skewed, J-shaped, and bimodal. Frequency Curve or Polygon: the horizontal axis uses marks. The vertical axis is either frequency or relative frequency. Several sets of data can be depicted on the same graph. Ogive: a cumulative frequency curve, always with a typical “upward” trend. Box-&-Whisker = a representation of the data set by splitting the distribution into four groups of 25%, often referred to as quartile distribution. Several sets of data can be pictures side-by=side using box-&-whisker plots, making the data comparisons easier for the reader. “key” points are: 1. 0% (or 10%) 2. 25% 3. 50% 4. 75% 5. 100% (or 90%) III. Calculate Measures AVERAGES: Mode = the data value that occurs most frequently. Ex: 6 7 8 9 9 10 Another ex: 6 3 2 3 3 5 3 2 If you cannot identify the ONE value that occurs most frequently, the data set has no mode. Ex: 3 3 4 5 5 7 Median = middle score in ranked data. Ex: 3 4 6 8 9 11 15 27 31 When there is an even number of data values, the median is halfway between the middle scores. Ex: 3 5 6 7 9 10 10 12 The median need not be a member of the data set. Midrange = the value halfway between the highest and lowest data value. Ex: 6 7 8 9 9 10 The midrange need not be a member of the data set. Midhinge = value halfway between the left hinge and right hinge of a box-&-whisker plot. The midhinge need not be a member of the data set. Mean = the value which is the sum of all data values divided by the number of pieces of data. Ex: 6 3 8 5 3 Mean = (6 + 3 + 8 + 5 + 3)/5 = 5 Ex: 85 76 93 82 96 Mean = The mean need not be a member of the data set. The mean is the most common measure of central tendency and is the statistics usually denoted by the word “average.” The mean is the “balance point” of a distribution, or the sum of the distances to the right of the mean equals the sum of the distances to the left. Math 103 Lecture 9 notes page 4 Ex: There is a salary dispute between management and labor at Castellon Manufacturing. The labor Union claims that the average salary is only $3000/year. Management says the average salary is $7300. You have been called in as a federal mediator. The first thing you need to do is to figure out the average salary. Suppose there are only 10 employees and you can get their monthly salaries from payroll. They are: $3000, $3000, $3000, $3500, $4000, $4500, $6000, $6000, $1000 and $25000 Does the Unions’ claim of #3000 seem like the “average”? Does the Management’s claim of $7300 seem like the “average”? Weighted Mean = Suppose one class of 20 students averaged 80% on a test, while another class of 30 students averaged 74%. What is the average for the combined group of students? DISPERSION OR VARIATION Range = the difference or distance between the highest to lowest data value. Variance, σ = sum of squared deviations divided by the number of data points Standard Deviation, s = √variance = (x – µ)^2/ n or (x – µ)^2/ (n-1) Note: for any distribution, the virtual spread (range) of the data is about 6 standard deviations. Standard deviation is usually rounded 1-2 places. Ex: data: 1 3 5 6 6 9 s= POSITION Quartiles = numbers that divide ranked data into fourths. A data set has 3 quartiles. 1st Quartile = a number such that at most 1/4 of the data are smaller in value, and at most 3/4 are larger. 2nd Quartile = median 3rd Quartile = a number such that at most 3/4 of the data are smaller in value, and at most 1/4 are larger. Percentiles = numbers that divide ranked data into 100 parts. A data set has 99 percentiles. Deciles = numbers that divide ranked data into 10 parts. A data set has 9 deciles. Here’s an example using a small data set, which contains an odd number of values. 35 47 48 50 51 53 54 70 75 Split the data in half, at the median, then find the median of each half. Interquartile range, IQR, Q3 – Q1 = 54–48 = 6 Here’s an example using a small data set, which contains an even number of values: 35 47 48 50 51 53 54 60 70 75 Split the data in half, at the median, then find the median of each half. Interquartile range, IQR, Q3 – Q1 = 60–48 = 12