Statistics Psych - 3301 CHAPTER 2: DEPICTING THE DATA In order that observations about the population of interest might be made, measures of certain characteristics will be taken and collected into sets. This represents the data that will be analyzed. The data sets, themselves are often large and unwieldy, so methods of depicting them at a glance have been developed. Representing data sets in graphs, charts or tables, then, is one way of organizing and summarizing the data. If sets are presented in terms of the way they fall across a scale (What numbers can be in the set) they are called DISTRIBUTIONS. If sets are presented in terms of how many cases of each number exist, they are called FREQUENCIES. DISTRIBUTIONS: A distribution is a set of numbers, generally depicted in a way that makes or possible types of members apparent (Are there twos or fives? What numbers are possible in this set?). FREQUENCY DISTRIBUTIONS: When distributions are depicted in terms of counts of each type of member, they are called FREQUENCY DISTRIBUTIONS. CHART or TABLE: The manner of presenting the set of data. COLUMNS and ROWS: A column for the data, and a column for the frequency. The frequency for a given number in on the same row, so that: X f 4 2 3 0 2 4 1 4 The data set, represented by the label X, depicted above is actually X = 1, 1, 1, 1, 2, 2, 2, 2, 4, 4. Notice the set has been reduced in terms of how much space is required to depict it on a page. Imagine a set with 10,000 members. Obviously, this organizes and summarizes the set in such a way, that it is rendered intelligible. Indeed, sets which could cover an entire wall when members are listed one at a time, can be depicted on a single page. Notice further, it is appropriate to point out when there are no members in the distribution at a particular point on the scale. In the above example, there happen to be no threes. By including the three in the X column, the scale remains complete and the reader is assured any potential threes were not overlook. The numbers of members in a distribution (N) can be determined by adding the f column. In the above example, N = 10. The notation would appear N = _f. You can not determine the sum of a set by adding the X column, except in those cases in which exactly one of each possible number on the scale occurs. Additional columns can be added to the table, providing even more organization to the distribution. For each extra column, the type of the distribution grows more specific. Statistics Psych - 3301 Types of distributions: FREQUENCY DISTRIBUTION (FD): A set(X) and a frequency (f). X 4 3 2 1 f 2 0 4 4 GROUPED FREQUENCY DISTRIBUTION (GFD): A set(x) in which intervals of possible members are presented by frequency (f). Here, the number of rows required to depict the entire set can be reduce, so that an enormous data set can be depicted on a single page. However, grouped distributions lose some detail as the original raw data in no longer observable. If a set has 20 members on the interval 10-19, how can the observer determine if they are 20 tens or not? X 40-49 30-39 20-29 10-19 f 2 0 4 4 CUMULATIVE GROUPED FREQUENCY DISTRIBUTION (CGFD): A set (X) in which the frequency of each interval accumulated up to the point of that interval or lower in an additional column called cumulative frequency (cf). X 40-49 30-39 20-29 10-19 f 2 0 4 4 cf 10 8 8 4 CUMULATIVE GROUP PERCENTILE FREQUENCY DISTRIBUTION(CGPFD): A set (X) depicted in the percent of each grouped cumulative frequency (%). In fact, the percentage may be cumulative in yet another column, as well. X 40-49 30-39 20-29 10-19 f 2 0 4 4 cf 10 8 8 4 % 20 0 40 40 c% 100 80 80 40 Statistics Psych - 3301 GRAPHS: Distributions can be depicted pictorially, rendering them concrete. These pictures, or GRAPHS, should be fitted to the scale of the data set. Just as the scale of the data may been ignored in practice, the appropriateness of the graph may be obscured by the limits of the graphics package of a researcher's computer program. Still, it is helpful to note these distinctions, when possible. BAR GRAPH: A graph comprised of distinct bars or lines for each interval of the data. The height of the bars indicate the frequency of interval of data. These are ideal for nominal data. HISTOGRAMS: The bars are touching, indicating continuity of scale. In this way, the rank or order of intervals is depicted. Due to the limits of dot matrix printers, this is the most common form of graph produced by personal computers. POLYGON: Instead of a bar's height determining the frequency of an interval of data, just a dot is placed. So streamlined is the FREQUENCY POLYGON that multiple distributions can be depicted on a single graph. In fact, polygon means multiple shapes. OGIVE: Frequency polygons can be cumulative. This is helpful when noting additive impacts, such as total growth rates, or total losses, such as in the case of epidemics. STEM AND LEAF: This graph is the only picture drawn with the original data set. The raw data is stack with in columns defined by some interval. The interval is define some range within the data. If the columns are arranged in decades (tens), then the second integer defines the interval. In the case of 40, four is the interval or stem. The zero is it's leaf. In the case of 45, the stem is still 4, but the lead if five. In the case of 52, the stem is 5 and the leaf is 2. What develops is a distribution with a shape or curve, just as in the case of a polygon. However, instead of just a simple line, the observer can still see the original data set. No information is lost. Unfortunately, stem and leaf graph's can only be used for data sets of limited size, due to the physical limit of space on the page. Consider the set X = 4*, 10, 19, 21, 23, 24, 33, 36, 37, 40, 45, 46, 55, 58, 63. This is appear as: 4 7 6 9 3 6 5 8 4 0 1 3 0 5 3 _____________ 0 1 2 3 4 5 6 *Note...the first decade includes the integers 0 through 9. Statistics Psych - 3301 INTERPOLATION: To calculate the interposition of a score within an interval is to INTERPOLATE. But why would you want to? One of the values of compiling data sets into grouped percentile frequencies, is to note the placement of a score within the distribution. This can be achieved by determining the percentile of a score. If a score is at the 50th percentile, then 50% of the distribution is below it or lesser in value. In fact, if a particular percentile is the initial interest, the score at that percentile rank, or position can be determined after the fact. In the case of grouped frequencies, the exact placement of a specific score must be approximated. This is done with INTERPOLATION. Let’s consider percentiles, percentile ranks and the process of interpolation, separately. PERCENTILE RANKS: The rank of a score as determined by the percentage of the distribution that lies below it. PERCENTILE: The score that is located when the rank is noted first. Consider the table: X 5 4 3 2 1 f 1 2 4 2 1 cf 10 9 7 3 1 % 10 20 40 20 10 c% 100 90 70 30 10 When a data set is depicted in a percentile distribution, it is treated as a continuous scale. To read the table, then, ABSOLUTE LIMITS must be applied. The absolute upper limit of the number five is 5.5 and it's absolute lower limit is 4.5. The integer 5, then, can be thought of as a continuous range from 4.5 to 5.5. All potential members of the distribution within that range can be counted as a five in the frequency column. More importantly, if a cumulative percentile must accumulate all possible members up to a point, regardless of the number of decimal places, than a percentile rank is necessarily the upper limit of that integer. DETERMINING PERCENTILES: In the above table, the 100th% is exactly equal to the percentile (score) 5.5. The score at the 70th% is 3.5. The percentile of a given percentile rank depicted in the C% column is determined by simply determining the upper limit of the score on the same line. DETERMINING PERCENTILE RANKS: If the initial interest is a given score, it's percentile rank, or position, can be determined by noting the percentile rank on the same line as the starting score. If one starts with a percentile of 2.5, the percentile rank is the 30th%. Note that this was easy because 2.5 is an upper limit. Statistics Psych - 3301 INTERPOLATION: When determining a percentile rank for a score which is not an upper limit, or when determining a percentile for a percentile rank which is not depicted in the C% column, one must interpolate. To find the interposition appropriate within one column, one must proceed the same distance they went in the original column. If the starting point is the percentile 2 (exactly 2 is 2.0), then the distance must be matched within the c% column. The integer 2 is exactly midway within the 2 interval (1.5 to 2.5). The percentile rank for 2 is midway within the appropriate percentile rank interval: in this instance, the 10th to the 30th%. The midpoint is the 20th percentile rank, because 20 is equidistance from 10 and 30 (ten points either way). Seldom done by hand today, you will not be expected to calculate this for out course. In interpreting the appropriate percentile for a given percentile rank, the process is reversed. For example, to determine the exact score for the 60th percentile rank, which does not appear on the table, one must first locate the 60th rank. It is located between the 30th and 70th ranks, an interval of 40 percentage points. As the 60th% is 10 points below the 70th%, in a space covering 40 percentage points, the 60th% is 1/4th of the way down from the top of that percentage range. This is how far below the percentile for the 60th% must be from the percentile for the 70th%. So, an equivalent range has been identified. The percentile for the 70th% is 3.5 and the percentile for the 30th% is 2.5. The range of percentiles (2.5 to 3.5) is equivalent to the range of ranks (30th to 70th%). The distance within the ranks is 40%, as 70-30 = 40. The distance within the equivalent range for the percentiles is 1, as 3.5 - 2.5 = 1. Since 60 is 1/4th of the way down within the range of 40 percentage points, the percentile is 1/4th of the way down within the range of 1 score point, or .25 points. This means the percentile for the 60th% is equal to 3.5 - .25 = 3.25. This is your answer. A percentile of 3.25 is at the 60th%. Notice that interpolation is really a way of translating scales. In this case, we went from a scale based upon 100ths to a scale based upon the raw data. Can you see that the rank for a percentile of 3 is the 50th, and that the percentile for the 80th% is exactly 4? Practice traveling the same distance in the column you are going to as you went in the column you are starting from will make this seem less foreign. Numerous sample questions are provided at the end of chapter two in the text, and in the work book as well. CAUTION hackers. Please resist the temptation to solve this with a 'push of the button'. If you never get your 'hands in the data' you will never get a feel for what you have completed. Once you get a 'feel' for what you are doing, simplifying your work with a computer program will come easily.