Organizing and describing Data Techniques for continuous variables Continuous variables are measurements that vary over a continuum (Weight, Blood Pressure, etc.) (as opposed to categorical variables Gender, religion, Marital Status etc.) The Grouped frequency table: The Histogram To Construct • A Grouped frequency table • A Histogram 1. Find the maximum and minimum of the observations. 2. Choose non-overlapping intervals of equal width (The Class Intervals) that cover the range between the maximum and the minimum. 3. The endpoints of the intervals are called the class boundaries. 4. Count the number of observations in each interval (The cell frequency - f). 5. Calculate relative frequency relative frequency = f/N Data Set #3 The following table gives data on Verbal IQ, Math IQ, Initial Reading Acheivement Score, and Final Reading Acheivement Score for 23 students who have recently completed a reading improvement program Student Verbal IQ Math IQ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 86 104 86 105 118 96 90 95 105 84 94 119 82 80 109 111 89 99 94 99 95 102 102 94 103 92 100 115 102 87 100 96 80 87 116 91 93 124 119 94 117 93 110 97 104 93 Initial Reading Acheivement 1.1 1.5 1.5 2.0 1.9 1.4 1.5 1.4 1.7 1.6 1.6 1.7 1.2 1.0 1.8 1.4 1.6 1.6 1.4 1.4 1.5 1.7 1.6 Final Reading Acheivement 1.7 1.7 1.9 2.0 3.5 2.4 1.8 2.0 1.7 1.7 1.7 3.1 1.8 1.7 2.5 3.0 1.8 2.6 1.4 2.0 1.3 3.1 1.9 70 to 80 80 to 90 90 to 100 100 to 110 110 to 120 120 to 130 Verbal IQ Math IQ 1 1 6 2 7 11 6 4 3 4 0 1 In this example the upper endpoint is included in the interval. The lower endpoint is not. Histogram – Verbal IQ 8 7 6 5 4 3 2 1 0 70 to 80 80 to 90 90 to 100 100 to 110 110 to 120 120 to 130 Histogram – Math IQ 12 10 8 6 4 2 0 70 to 80 80 to 90 90 to 100 100 to 110 110 to 120 120 to 130 Example • In this example we are comparing (for two drugs A and B) the time to metabolize the drug. • 120 cases were given drug A. • 120 cases were given drug B. • Data on time to metabolize each drug is given on the next two slides Drug A 22.6 31.5 7.2 13.0 11.2 6.4 4.8 3.5 11.9 4.1 6.7 6.7 6.0 7.7 11.7 11.7 8.5 30.0 7.2 10.0 17.8 6.3 11.4 6.4 8.1 5.7 3.2 13.4 7.8 16.8 9.0 8.9 10.5 13.1 6.4 21.9 6.3 6.2 5.4 17.2 18.8 7.2 12.9 6.3 13.6 4.3 7.5 14.1 21.9 7.4 8.8 10.5 12.6 14.9 6.2 2.9 5.2 3.8 9.7 19.6 10.5 3.5 12.7 20.1 25.3 11.2 2.0 1.8 22.0 5.1 20.1 7.0 6.0 8.0 6.0 3.8 13.6 8.5 9.8 33.5 6.5 4.7 5.3 7.4 2.5 18.7 5.6 2.3 7.9 6.8 12.3 10.1 14.9 19.2 10.8 9.3 14.9 11.8 12.7 1.5 11.8 5.1 18.0 4.1 9.0 6.5 15.4 3.9 4.8 6.3 4.3 17.4 11.3 2.7 30.0 3.1 10.9 3.3 28.3 6.4 Drug B 4.2 10.4 8.2 13.4 10.5 5.6 19.0 4.5 25.9 3.2 5.5 7.8 6.0 5.3 4.8 10.8 5.4 25.2 6.6 5.1 12.8 5.4 6.0 4.3 6.0 7.3 5.9 10.2 10.4 2.7 4.6 3.5 2.9 3.0 4.6 13.4 8.3 2.9 15.1 4.0 3.2 5.0 4.9 2.7 14.3 9.6 10.6 2.8 12.9 4.2 2.7 5.4 4.4 5.7 7.7 5.8 4.1 11.5 12.3 5.1 7.8 5.1 5.9 10.3 12.4 4.7 6.3 9.4 4.5 3.3 7.5 12.6 4.1 3.0 4.8 5.3 9.3 8.8 10.9 7.4 3.2 5.1 17.0 20.9 8.1 4.8 9.3 24.1 2.6 13.7 5.1 8.8 5.0 9.7 4.1 7.7 8.3 5.9 6.0 16.0 8.8 14.1 2.5 15.3 5.2 7.8 11.4 9.2 10.6 3.7 5.0 8.5 12.1 8.5 6.9 12.1 8.0 4.1 2.3 2.8 Grouped frequency tables Class interval 0 to 4 4 to 8 8 to 12 12 to 16 16 to 20 20 to 24 24 to 28 28 to 32 32 to 36 36 to 40 40 to 44 44 to 48 Drug A 15 43 26 15 9 6 1 4 1 0 0 0 Drug B 19 54 26 15 2 1 3 0 0 0 0 0 Histogram – drug A (time to metabolize) 60 50 40 30 20 10 0 o 0t 4 o 4t 8 o 8t 12 1 o1 t 2 6 1 o2 t 6 0 2 o2 t 0 4 2 o2 t 4 8 2 o3 t 8 2 3 o3 t 2 6 3 o4 t 6 0 4 o4 t 0 4 4 o4 t 4 8 Histogram – drug B (time to metabolize) 60 50 40 30 20 10 0 o 0t 4 o 4t 8 o 8t 12 1 o1 t 2 6 1 o2 t 6 0 2 o2 t 0 4 2 o2 t 4 8 2 o3 t 8 2 3 o3 t 2 6 3 o4 t 6 0 4 o4 t 0 4 4 o4 t 4 8 The Grouped frequency table: The Histogram To Construct • A Grouped frequency table • A Histogram To Construct - A Grouped frequency table 1. Find the maximum and minimum of the observations. 2. Choose non-overlapping intervals of equal width (The Class Intervals) that cover the range between the maximum and the minimum. 3. The endpoints of the intervals are called the class boundaries. 4. Count the number of observations in each interval (The cell frequency - f). 5. Calculate relative frequency relative frequency = f/N To draw - A Histogram Draw above each class interval: • A vertical bar above each Class Interval whose height is either proportional to The cell frequency (f) or the relative frequency (f/N) frequency (f) or relative frequency (f/N) Class Interval Some comments about histograms • The width of the class intervals should be chosen so that the number of intervals with a frequency less than 5 is small. • This means that the width of the class intervals can decrease as the sample size increases • If the width of the class intervals is too small. The frequency in each interval will be either 0 or 1 • The histogram will look like this • If the width of the class intervals is too large. One class interval will contain all of the observations. • The histogram will look like this • Ideally one wants the histogram to appear as seen below. • This will be achieved by making the width of the class intervals as small as possible and only allowing a few intervals to have a frequency less than 5. 80 70 60 50 40 30 20 10 55 -1 45 15 0 -1 35 14 0 -1 25 13 0 -1 15 0 12 11 0 -1 05 5 10 0 -1 -9 5 90 -8 5 80 -7 70 60 -6 5 0 • As the sample size increases the histogram will approach a smooth curve. • This is the histogram of the population 80 70 60 50 40 30 20 10 55 -1 45 15 0 -1 35 14 0 -1 25 13 0 -1 15 0 12 11 0 -1 05 5 10 0 -1 -9 5 90 -8 5 80 -7 70 60 -6 5 0 N = 25 10 9 8 7 6 5 4 3 2 1 0 60 - 70 70 - 80 80 - 90 90 - 100 100 110 110 120 120 130 130 140 140 150 N = 100 30 25 20 15 10 5 0 60 - 70 70 - 80 80 - 90 90 - 100 100 - 110 110 - 120 120 - 130 130 - 140 140 - 150 -9 5 -8 5 -7 5 -6 5 -1 11 05 0 -1 12 15 0 -1 13 25 0 -1 14 35 0 -1 15 45 0 -1 55 10 0 90 80 70 60 N = 500 80 70 60 50 40 30 20 10 0 N = 2000 140 120 100 80 60 40 20 0 4 2 0 8 6 4 4 2 0 8 6 - 6 - 7 - 8 - 8 - 9 - 10 - 11 - 12 - 12 - 13 - 14 62 70 78 86 94 02 10 18 26 34 42 1 1 1 1 1 1 N=∞ 0.03 0.025 0.02 0.015 0.01 0.005 0 50 60 70 80 90 100 110 120 130 140 150 Comment: the proportion of area under a histogram between two points estimates the proportion of cases in the sample (and the population) between those two values. Example: The following histogram displays the birth weight (in Kg’s) of n = 100 births 25 20 19 20 17 15 10 11 12 10 5 3 1 1 4 1 1 0 0.085 0.113 0.142 0.17 0.198 0.227 0.255 0.283 0.312 0.34 0.369 0.397 0.425 0.454 to to to to to to to to to to to to to to 0.113 0.142 0.17 0.198 0.227 0.255 0.283 0.312 0.34 0.369 0.397 0.425 0.454 0.482 Find the proportion of births that have a birthweight less than 0.34 kg. Proportion = (1+1+3+10+11+19+17)/100 = 0.62 The Characteristics of a Histogram • Central Location (average) • Spread (Variability, Dispersion) • Shape Central Location 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 5 10 15 20 25 Spread, Dispersion, Variability 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 5 10 15 20 25 Shape – Bell Shaped (Normal) 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 5 10 15 20 25 Shape – Positively skewed 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 5 10 15 20 25 Shape – Negatively skewed 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 5 10 15 20 25 Shape – Platykurtic 0 -3 -2 -1 0 1 2 3 Shape – Leptokurtic 0 -3 -2 -1 0 1 2 3 Shape – Bimodal 0 -3 -2 -1 0 1 2 3 The Stem-Leaf Plot An alternative to the histogram Each number in a data set can be broken into two parts – A stem – A Leaf Example Verbal IQ = 84 84 Stem Leaf –Stem = 10 digit = 8 – Leaf = Unit digit = 4 Example Verbal IQ = 104 104 Stem Leaf –Stem = 10 digit = 10 – Leaf = Unit digit = 4 To Construct a Stem- Leaf diagram • Make a vertical list of “all” stems • Then behind each stem make a horizontal list of each leaf Example The data on N = 23 students Variables • Verbal IQ • Math IQ • Initial Reading Achievement Score • Final Reading Achievement Score Data Set #3 The following table gives data on Verbal IQ, Math IQ, Initial Reading Acheivement Score, and Final Reading Acheivement Score for 23 students who have recently completed a reading improvement program Student Verbal IQ Math IQ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 86 104 86 105 118 96 90 95 105 84 94 119 82 80 109 111 89 99 94 99 95 102 102 94 103 92 100 115 102 87 100 96 80 87 116 91 93 124 119 94 117 93 110 97 104 93 Initial Reading Acheivement 1.1 1.5 1.5 2.0 1.9 1.4 1.5 1.4 1.7 1.6 1.6 1.7 1.2 1.0 1.8 1.4 1.6 1.6 1.4 1.4 1.5 1.7 1.6 Final Reading Acheivement 1.7 1.7 1.9 2.0 3.5 2.4 1.8 2.0 1.7 1.7 1.7 3.1 1.8 1.7 2.5 3.0 1.8 2.6 1.4 2.0 1.3 3.1 1.9 We now construct: a stem-Leaf diagram of Verbal IQ A vertical list of the stems 8 9 10 11 12 We now list the leafs behind stem 86 104 86 105 118 96 90 95 105 84 94 119 82 80 109 11 1 89 99 94 99 95 102 102 8 9 10 11 12 86 104 86 105 118 96 90 95 105 84 94 119 82 80 109 11 1 89 99 94 99 95 102 102 8 9 10 11 12 8 9 10 11 12 664209 60549495 4559 22 891 The leafs may be arranged in order 8 9 10 11 12 024669 04455699 224559 189 The stem-leaf diagram is equivalent to a histogram 8 9 10 11 12 024669 04455699 224559 189 The stem-leaf diagram is equivalent to a histogram 8 9 10 11 12 024669 04455699 224559 189 Rotating the stem-leaf diagram we have 80 90 100 110 120 The two part stem leaf diagram Sometimes you want to break the stems into two parts for leafs 0,1,2,3,4 * for leafs 5,6,7,8,9 Stem-leaf diagram for Initial Reading Acheivement 1. 01234444455556666677789 2. 0 This diagram as it stands does not give an accurate picture of the distribution We try breaking the stems into two parts 1.* 012344444 1. 55556666677789 2.* 0 2. The five-part stem-leaf diagram If the two part stem-leaf diagram is not adequate you can break the stems into five parts for leafs 0,1 t for leafs 2,3 f for leafs 4, 5 s for leafs 6,7 * for leafs 8,9 We try breaking the stems into five parts 1.* 01 1.t 23 1.f 444445555 1.s 66666777 1. 89 2.* 0 Stem leaf Diagrams Verbal IQ, Math IQ, Initial RA, Final RA Some Conclusions • Math IQ, Verbal IQ seem to have approximately the same distribution • “bell shaped” centered about 100 • Final RA seems to be larger than initial RA and more spread out • Improvement in RA • Amount of improvement quite variable Next Topic • Numerical Measures - Location