BIOSTATISTICS NURS 3324 3 Organizing and Displaying Data Any survey or experiment yields a list of observations. These need to be organized and summarized in a logical fashion so that we may perceive the outcome clearly. Tables and graphs are popularly used to organize and summarize data and description of data. A. Frequency Tables/ Frequency Distributions Considerable information can be obtained from large masses of statistical data by grouping the raw data into classes and determining the number of observations that fall in each of the classes. Such an arrangement is called a frequency distribution or frequency table. Frequency table may be the most convenient way of summarizing or displaying data. The types of frequency distributions that will be considered here are categorical or qualitative frequency distributions, and grouped frequency distributions. Categorical or Simple Frequency Distributions Categorical frequency distributions represent data that can be placed in specific categories, such as gender, hair color, or blood group. Example: The blood types of 25 blood donors are given below. Summarize the data using a frequency distribution. AB O B A A B B O O B A O B AB AB O A B AB O B O B O A Solution: We will represent the blood types as classes and the number of occurrences for each blood type as frequencies. The frequency table (distribution) in the following table summarizes the data. Frequency Table for the above Example Class (Blood Type) A B O AB Total Frequency 5 8 8 4 25 Grouped Frequency Distributions A grouped frequency distribution is obtained by constructing class intervals for the data, and then listing the corresponding number of values (frequency count) in each interval. Tables 3.2 and 3.3 are examples of frequency tables, constructed from the systolic blood pressure readings (by smoking status) of Table 3.1. 13 BIOSTATISTICS NURS 3324 Table 3.1 Smoking status and the systolic blood pressure for a sample of 100 individuals Smoking Status* Systolic blood pressure Smoking Status* Systolic blood pressure 1 2 3 4 5 1 0 1 1 0 102 138 190 122 128 36 37 38 39 40 0 0 0 1 1 142 122 146 126 176 6 7 8 9 10 0 0 1 0 0 112 128 116 134 104 41 42 43 44 45 1 0 1 1 1 11 12 13 14 15 16 17 18 19 20 1 0 0 0 0 0 1 0 0 0 116 152 134 132 130 118 136 108 108 128 46 47 48 49 50 51 52 53 54 55 21 22 23 24 25 1 1 1 0 1 118 134 178 134 162 26 27 28 29 30 0 1 0 0 0 31 32 33 34 35 0 1 0 0 0 ID Smoking Status* Systolic blood pressure 71 72 73 74 75 1 0 1 1 0 116 154 126 140 122 104 112 140 102 142 76 77 78 79 80 0 1 0 0 1 154 140 120 140 114 1 0 1 0 1 0 0 1 0 1 146 92 112 152 116 118 128 116 134 108 81 82 83 84 85 86 87 88 89 90 0 0 0 0 0 1 0 0 1 0 122 94 122 172 100 150 154 170 140 144 56 57 58 59 60 0 0 0 0 0 134 124 124 114 154 91 92 93 94 95 0 0 0 1 0 156 132 140 150 130 162 120 98 144 118 61 62 63 64 65 1 0 1 0 1 114 114 98 128 130 96 97 98 99 100 0 0 0 0 1 118 162 128 130 208 118 138 134 108 96 66 67 68 69 70 1 1 0 0 0 122 112 106 128 128 ID ID *: 1 and 0 represent smoking and nonsmoking person respectively. How to construct a frequency table? 1. Arrange the data into an array, a listing of all observations from smallest to largest in order to determine the interval spanned by the data. We find that the blood pressure interval for smokers for example is 98-208. 14 BIOSTATISTICS Systolic Blood Pressure of Smokers from Table 3.1 98 116 130 150 102 116 134 150 102 116 136 162 104 116 138 176 108 118 140 178 112 120 140 190 112 122 140 208 114 122 140 114 126 142 116 126 146 NURS 3324 Systolic Blood Pressure of Non-Smokers from Table 3.1 92 112 122 128 134 144 162 94 112 122 128 134 146 170 96 114 122 128 134 152 172 98 114 122 128 134 152 100 118 124 130 134 154 104 118 124 130 138 154 106 118 128 130 140 154 108 118 128 132 140 154 108 118 128 132 142 156 108 120 128 134 144 162 2. Determine the range (R) from the difference between the smallest and largest value in the set of observations i.e. R = largest data value – smallest data value = 208-98 =110 mm. 3. Divide the range into a number of equal and non-overlapping segments called class intervals. Important Note The number of intervals in general should range from 5 to 15. With too many class intervals, the data are not summarized enough for a clear visualization of how they are distributed. With too few, the data are over-summarized and some of the details of the distribution may lost. Sturges’s formula and the desired number of class intervals Those who wish more specific guidance in the matter of deciding how many class intervals are needed may use Sturges’s formula; k = 1 + 3.322(log10 n), where k stands for the number of class intervals and n is the number of values in the data set under consideration (or the sample size) Example: Determine the k value for the 37 smokers we want to group. k = 1 + 3.322(log10 37) k = 1 + 3.322(1.568) = 6.21 6 Note that the value of k has been rounded to the nearest whole number. The answer obtained by Sturges’ rule should not considered as final, but as guide only, should be increased or decreased for convenience and clear presentation. Suppose we decide to use 6 intervals. 15 BIOSTATISTICS NURS 3324 4. Determine the size (length or width) of the class interval (w) by dividing the range (R) by the number of class intervals required or (k). If you want the class width to be a whole number, always increase the result to the next whole number so that the classes cover the data. w R/k = 110/6 = 18.3 increase to 19 or for easiness to 20. 5. Construct a table with three columns, and then write the class intervals in the first column. Start the first class interval with the smallest value or less. This value is called as the lower class limit. Example: The smallest value for systolic blood pressure of smokers and nonsmokers is 98 and 92 respectively. For easiness and for comparison purposes, we will begin at 90. Add the class width to this number to get the lower class limit of the next class interval. Determine the first class interval which contains all the values between the lower class limits of two successive intervals including the lower class limit of the first class interval only. i.e., 90, 91, 92, 93, 94, ……………………………. 109 The 109 here is called the upper class limits. Repeat the above steps for the second, third, …….until the last class interval Notes Intervals are usually equal in size (= 20 in our example), thereby aiding the comparisons between the frequencies of any intervals. The upper limit of the last interval consists of either the largest value or larger. 6. Insert in the next column provided a tally for each individual observation in the raw data table. Note that, the tally column is included simply as an aid for determining the frequencies. It is not a necessary part of a frequency table. 7. Sum the tally in each row and record them in the third column entitled Frequency (f). 8. Sum the frequency column (n). This serves as a useful check that all data have been included in the table. Note Frequency tables should be numbered, includes an appropriate descriptive title, specify the units of measurement, and cite the source of data. 16 BIOSTATISTICS NURS 3324 Table 3.2 Frequency Table for Systolic Blood Pressure of Smokers from Table 3.1 Class interval (Systolic Blood Pressure*) 90-109 110-129 130-149 150-169 170-189 190-209 Total Tally f (frequency) 5 15 10 3 2 2 37 *In millimeters of mercury. Table 3.3 Frequency Table for Systolic Blood Pressure of Nonsmokers from Table 3.1 Class Interval (Systolic Blood Pressure*) Tally f (Frequency) 90-109 110-129 130-149 150- 169 170-189 190-209 Total 10 24 18 9 2 0 63 *In millimeters of mercury. Frequency Tables with class boundaries (true class intervals) Class boundaries may be used in place of class limits. Class boundaries are points that demarcate the true upper limit of one class and the true lower limit of the next. Class boundaries can be easily obtained by applying the formula: Class boundary = Upper limit of one class + lower limit of next class 2 Example Determine the class boundaries for the class intervals listed in the table of smokers Class interval Class boundaries (Systolic Blood Pressure*) f (frequency) 90-109 89.5-109.5 5 110-129 109.5-129.5 15 130-149 129.5-149.5 10 150-169 149.5-169.5 3 170-189 169.5-189.5 2 190-209 189.5-209.5 2 Total n = 37 17 BIOSTATISTICS NURS 3324 Relative frequency التكرارات النسبية The relative frequency for any class is obtained by dividing the frequency for that class by the total number of all frequencies (observations or sample size) i.e., f/n. Example, the relative frequency of the first class, 90-109 mm of smoker is 5/37= 0.14 Percentage relative frequency (p) If each relative frequency is multiplied by 100%, we have a percentage relative frequency (p), i.e. p=(f/n).100. Example, the percentage relative frequency of the first class, 90-109 mmHg of nonsmoker is (5/37)100 = 14%. Class Interval (Systolic Blood Pressure*) 90-109 Frequency 5 Relative Frequency 0.14 Relative Frequency (%) 14 110-129 15 0.41 41 130-149 10 0.27 27 150-169 3 0.08 8 170-189 2 0.05 5 190-209 Total 2 37 0.05 1 5 100 Significance It is helpful in making comparison between two sets of data that have a different number of observations, like our 63 nonsmokers and 37 smokers. For example, in the blood pressure range of 90-109 mm, 10 (16%) of the nonsmokers and 5 (14%) of the smokers were represented. Example Class Interval (Systolic Blood Pressure*) Relative Frequency (%) Nonsmokers Smokers 90-109 16 14 110-129 38 41 130-149 29 27 150-169 14 8 170-189 3 5 190-209 0 5 18 BIOSTATISTICS NURS 3324 Cumulative percentage relative frequency التكرار النسبي المتجمع, It is also known as cumulative percentage It shows the percentage of elements lying within and below each class interval Cumulative percentage relative frequency can be computed by cumulating the percentage relative frequencies of each of the various class intervals. For nonsmokers (Table 3.4), the cumulative percentage for the first four intervals is 16 + 38 + 29 + 14 = 97%. Significance: Make a rapid comparison of entire frequency distributions, ruling out any need to compare individual class intervals. In our example, for example, 97% of the nonsmokers in the sample have a systolic blood pressure ≤ 169 (or below 169.5). By comparison, 90% of the smokers have a blood pressure below the same level. An alternate way of looking at this is to note that 3% of the nonsmokers and 10% of the smokers have a systolic blood pressure above > 169 (or > 169.5). Table 3.4 Comparison of Systolic Blood Pressure between Smokers and Nonsmokers from Table 3.1 Relative Frequency (%) Cumulative percentage Class Interval relative Frequency (%) (Systolic Blood Pressure*) Nonsmokers Smokers Nonsmokers Smokers 90-109 16 14 16 14 110-129 38 41 54 55 130-149 29 27 83 82 150-169 14 8 97 90 170-189 3 5 100 95 190-209 0 5 100 100 * In millimeters of mercury. Example 2: The following table gives the hemoglobin levels in (g/dl) of a sample of 50 apparently healthy men aged 20-24. 17.0 16.1 15.2 17.4 16.4 13.5 16.8 15.8 17.4 15.9 16.4 16.1 15.9 17.1 17.5 17.8 15.8 18.3 16.4 14.4 13.9 15.9 16.3 16.2 17.3 14.2 14.6 15.1 16.7 16.2 17.3 17.0 16.2 15.0 14.9 17.7 15.5 15.3 16.5 15.3 16.3 15.9 14.0 15.7 16.1 16.1 15.7 15.8 13.7 16.3 Prepare a grouped frequency distribution for this data, for the class intervals: 13.0 – 13.9, 14.0 – 14.9, 15.0 – 15.9, 16.0 – 16.9, 17.0 – 17.9, 18.0 – 18.9 19 BIOSTATISTICS NURS 3324 Solution: It is a very popular to define class intervals in this way Cumulative frequency Determine the boundaries (true class intervals) and midpoint of each class interval 20 BIOSTATISTICS NURS 3324 Example 3: Find the class boundaries (true class intervals), midpoint, relative frequencies, and cumulative frequency for the following table of distributions for the age Solution: 21