Chap2.stat.doc

advertisement
Statistics
Psych - 3301
CHAPTER 2: DEPICTING THE DATA
In order that observations about the population of interest might
be made, measures of certain characteristics will be taken and
collected into sets. This represents the data that will be analyzed.
The data sets, themselves are often large and unwieldy, so methods of
depicting them at a glance have been developed. Representing data sets
in graphs, charts or tables, then, is one way of organizing and
summarizing the data. If sets are presented in terms of the way they
fall across a scale (What numbers can be in the set) they are called
DISTRIBUTIONS. If sets are presented in terms of how many cases of each
number exist, they are called FREQUENCIES.
DISTRIBUTIONS: A distribution is a set of numbers, generally
depicted in a way that makes or possible types of members
apparent (Are there twos or fives? What numbers are possible in
this set?).
FREQUENCY DISTRIBUTIONS: When distributions are depicted in
terms of counts of each type of member, they are called
FREQUENCY DISTRIBUTIONS.
CHART or TABLE: The manner of presenting the set of data.
COLUMNS and ROWS: A column for the data, and a column
for the frequency. The frequency for a given number in
on the same row, so that:
X f
4 2
3 0
2 4
1 4
The data set, represented by the label X, depicted above is
actually X = 1, 1, 1, 1, 2, 2, 2, 2, 4, 4. Notice the set has been
reduced in terms of how much space is required to depict it on a page.
Imagine a set with 10,000 members. Obviously, this organizes and
summarizes the set in such a way, that it is rendered intelligible.
Indeed, sets which could cover an entire wall when members are listed
one at a time, can be depicted on a single page. Notice further, it is
appropriate to point out when there are no members in the distribution
at a particular point on the scale. In the above example, there happen
to be no threes. By including the three in the X column, the scale
remains complete and the reader is assured any potential threes were
not overlook. The numbers of members in a distribution (N) can be
determined by adding the f column. In the above example, N = 10. The
notation would appear N = _f. You can not determine the sum of a set by
adding the X column, except in those cases in which exactly one of each
possible number on the scale occurs.
Additional columns can be added to the table, providing even more
organization to the distribution. For each extra column, the type of
the distribution grows more specific.
Statistics
Psych - 3301
Types of distributions:
FREQUENCY DISTRIBUTION (FD): A set(X) and a frequency (f).
X
4
3
2
1
f
2
0
4
4
GROUPED FREQUENCY DISTRIBUTION (GFD): A set(x) in which
intervals of possible members are presented by frequency
(f). Here, the number of rows required to depict the
entire set can be reduce, so that an enormous data set can
be depicted on a single page. However, grouped
distributions lose some detail as the original raw data in
no longer observable. If a set has 20 members on the
interval 10-19, how can the observer determine if they are
20 tens or not?
X
40-49
30-39
20-29
10-19
f
2
0
4
4
CUMULATIVE GROUPED FREQUENCY DISTRIBUTION (CGFD): A set
(X) in which the frequency of each interval accumulated up
to the point of that interval or lower in an additional
column called cumulative frequency (cf).
X
40-49
30-39
20-29
10-19
f
2
0
4
4
cf
10
8
8
4
CUMULATIVE GROUP PERCENTILE FREQUENCY DISTRIBUTION(CGPFD):
A set (X) depicted in the percent of each grouped
cumulative frequency (%). In fact, the percentage may be
cumulative in yet another column, as well.
X
40-49
30-39
20-29
10-19
f
2
0
4
4
cf
10
8
8
4
%
20
0
40
40
c%
100
80
80
40
Statistics
Psych - 3301
GRAPHS:
Distributions can be depicted pictorially, rendering them concrete.
These pictures, or GRAPHS, should be fitted to the scale of the data
set. Just as the scale of the data may been ignored in practice, the
appropriateness of the graph may be obscured by the limits of the
graphics package of a researcher's computer program.
Still, it is helpful to note these distinctions, when possible.
BAR GRAPH: A graph comprised of distinct bars
or lines for each interval of the data. The height
of the bars indicate the frequency of interval of data.
These are ideal for nominal data.
HISTOGRAMS: The bars are touching, indicating
continuity of scale. In this way, the rank or order
of intervals is depicted. Due to the limits of dot
matrix printers, this is the most common form of graph
produced by personal computers.
POLYGON: Instead of a bar's height determining the
frequency of an interval of data, just a dot is placed.
So streamlined is the FREQUENCY POLYGON that multiple
distributions can be depicted on a single graph. In
fact, polygon means multiple shapes.
OGIVE: Frequency polygons can be cumulative. This
is helpful when noting additive impacts, such as total
growth rates, or total losses, such as in the case of
epidemics.
STEM AND LEAF: This graph is the only picture drawn with the
original data set. The raw data is stack with in columns defined
by some interval. The interval is define some range within the
data. If the columns are arranged in decades (tens), then the
second integer defines the interval. In the case of 40, four is
the interval or stem. The zero is it's leaf. In the case of 45,
the stem is still 4, but the lead if five. In the case of 52,
the stem is 5 and the leaf is 2. What develops is a distribution
with a shape or curve, just as in the case of a polygon.
However, instead of just a simple line, the observer can still
see the original data set. No information is lost.
Unfortunately, stem and leaf graph's can only be used for data
sets of limited size, due to the physical limit of space on the
page. Consider the set X = 4*, 10, 19, 21, 23, 24, 33, 36, 37,
40, 45, 46, 55, 58, 63. This is appear as:
4 7 6
9 3 6 5 8
4 0 1 3 0 5 3
_____________
0 1 2 3 4 5 6
*Note...the first decade includes the integers 0 through 9.
Statistics
Psych - 3301
INTERPOLATION:
To calculate the interposition of a score within an interval is to
INTERPOLATE. But why would you want to? One of the values of compiling
data sets into grouped percentile frequencies, is to note the placement
of a score within the distribution. This can be achieved by determining
the percentile of a score. If a score is at the 50th percentile, then
50% of the distribution is below it or lesser in value. In fact, if a
particular percentile is the initial interest, the score at that
percentile rank, or position can be determined after the fact. In the
case of grouped frequencies, the exact placement of a specific score
must be approximated. This is done with INTERPOLATION. Let’s consider
percentiles, percentile ranks and the process of interpolation,
separately.
PERCENTILE RANKS: The rank of a score as determined by the
percentage of the distribution that lies below it.
PERCENTILE: The score that is located when the rank is noted
first.
Consider the table:
X
5
4
3
2
1
f
1
2
4
2
1
cf
10
9
7
3
1
%
10
20
40
20
10
c%
100
90
70
30
10
When a data set is depicted in a percentile distribution, it is
treated as a continuous scale. To read the table, then, ABSOLUTE LIMITS
must be applied. The absolute upper limit of the number five is 5.5 and
it's absolute lower limit is 4.5. The integer 5, then, can be thought
of as a continuous range from 4.5 to 5.5. All potential members of the
distribution within that range can be counted as a five in the
frequency column. More importantly, if a cumulative percentile must
accumulate all possible members up to a point, regardless of the number
of decimal places, than a percentile rank is necessarily the upper
limit of that integer.
DETERMINING PERCENTILES: In the above table, the 100th% is
exactly equal to the percentile (score) 5.5. The score at the
70th% is 3.5. The percentile of a given percentile rank depicted
in the C% column is determined by simply determining the upper
limit of the score on the same line.
DETERMINING PERCENTILE RANKS: If the initial interest is a given
score, it's percentile rank, or position, can be determined by
noting the percentile rank on the same line as the starting
score. If one starts with a percentile of 2.5, the percentile
rank is the 30th%. Note that this was easy because 2.5 is an
upper limit.
Statistics
Psych - 3301
INTERPOLATION: When determining a percentile rank for a score
which is not an upper limit, or when determining a percentile
for a percentile rank which is not depicted in the C% column,
one must interpolate. To find the interposition appropriate
within one column, one must proceed the same distance they went
in the original column. If the starting point is the percentile
2 (exactly 2 is 2.0), then the distance must be matched within
the c% column. The integer 2 is exactly midway within the 2
interval (1.5 to 2.5). The percentile rank for 2 is midway
within the appropriate percentile rank interval: in this
instance, the 10th to the 30th%. The midpoint is the 20th
percentile rank, because 20 is equidistance from 10 and 30 (ten
points either way). Seldom done by hand today, you will not be
expected to calculate this for out course.
In interpreting the appropriate percentile for a given percentile
rank, the process is reversed. For example, to determine the exact
score for the 60th percentile rank, which does not appear on the table,
one must first locate the 60th rank. It is located between the 30th and
70th ranks, an interval of 40 percentage points. As the 60th% is 10
points below the 70th%, in a space covering 40 percentage points, the
60th% is 1/4th of the way down from the top of that percentage range.
This is how far below the percentile for the 60th% must be from the
percentile for the 70th%. So, an equivalent range has been identified.
The percentile for the 70th% is 3.5 and the percentile for the
30th% is 2.5. The range of percentiles (2.5 to 3.5) is equivalent to
the range of ranks (30th to 70th%). The distance within the ranks is
40%, as 70-30 = 40. The distance within the equivalent range for the
percentiles is 1, as 3.5 - 2.5 = 1. Since 60 is 1/4th of the way down
within the range of 40 percentage points, the percentile is 1/4th of
the way down within the range of 1 score point, or .25 points. This
means the percentile for the 60th% is equal to 3.5 - .25 = 3.25. This
is your answer. A percentile of 3.25 is at the 60th%. Notice that
interpolation is really a way of translating scales. In this case, we
went from a scale based upon 100ths to a scale based upon the raw data.
Can you see that the rank for a percentile of 3 is the 50th, and
that the percentile for the 80th% is exactly 4?
Practice traveling the same distance in the column you are going
to as you went in the column you are starting from will make this seem
less foreign. Numerous sample questions are provided at the end of
chapter two in the text, and in the work book as well.
CAUTION hackers. Please resist the temptation to solve this with a
'push of the button'. If you never get your 'hands in the data' you
will never get a feel for what you have completed. Once you get a
'feel' for what you are doing, simplifying your work with a computer
program will come easily.
Download