Uploaded by Simon Blamo

01. Displaying Dist'ns of Data

advertisement
1
Displaying Distributions of Data
When collecting data, it is important to keep in mind the W’s—Who is being
observed; What precisely is being observed, and in what units; When the data was
collected; Where the data was collected; Why the data was collected, and How the
data was collected. Remember that it is possible to demonstrate just about anything
by improperly using statistics, and the first step in determining if things are done
properly is by answering these questions. If you cannot answer any of these, then
you should be suspicious of any conclusions reached from the data, especially if the
persons collecting the data have some type of agenda.
Terminology
The characteristic being observed is called a variable. You can think of the variable
itself as saying in words what exactly is being observed, counted, or measured.
Variables are denoted by capital letters—usually X, Y, or Z. (The letter Z is
reserved for a very specific variable that we will encounter in a couple of weeks.)
Each individual in the sample will have a value of the variable which is the category
or number of the variable for that individual; it is possible that each individual will
have a different value. Values of variables are denoted by lower case letters, using
the same letter as the variable—x, y, or z.
The word variable is very often misused by introductory statistics students. In
particular, they use the word “variable” when they mean “value” or “observation”.
Be very careful when you use the word variable, because I will penalize you if you
use it incorrectly.
For example, suppose the individuals being observed are people. When you look at
an individual person, what value should you record for him/her? The variable tells
you what to record—their gender, their height, their GPA, etc. If the variable X =
gender, then the possible values of this variable are “male” and “female”. If the
variable is X = height, then the possible values are any number between, say, 1 foot
and 8 feet. Each individual person will have a potentially different value for their
height.
2
Two Types of Variables
There are two types of variables. A variable such as gender where the possible
values are words as opposed to numbers is called a Categorical or Qualitative
variable. A variable such as height where the possible values are numbers (for
which arithmetic makes sense) is called a Numerical or Quantitative variable.
The “arithmetic makes sense” is important in distinguishing a numerical variable.
Values of any categorical variable can always be converted to numbers—for
example, for the variable “gender”, you can let male = 1 and female = 2, but it
doesn’t make sense to say that females are twice as much as males. Furthermore,
some categorical variables have values that are numbers initially. For example, for
the variable X = ZIP code, the values are all numbers, but it would not make sense
to say that my ZIP code is 257 more than yours.
Some variables fall somewhere between being categorical and numerical. For
example, suppose you are rating something on a scale of 1-5, where 1 is bad and 5
is good. It may not be true that 2 is twice as much as 1, but it is higher than 1 in its
level of goodness. Such variables are called “ordinal” and could potentially be
treated as either categorical or numerical, depending on what you want to do with
them.
Displaying Distributions of Variables
The distribution of a variable gives you two pieces of information—the possible
values of the variable, and some representation of how often these values occur.
The latter can be represented with frequencies, relative frequencies, percents, or
probabilities. The distribution can be represented in tabular form, where you
actually list the possible values along with their frequencies, or they can be
represented with a picture. Later we will see that they can also be represented by
formulas.
Any picture of a distribution, whether the variable is categorical or numerical,
should obey the Area Principle. When we look at a picture, our eye compares
relative areas in the picture, and the bigger the area, the more likely our brain
interprets that value as being. The Area Principle states that the areas for values in
the picture must be proportional to the percent of observations which take on that
value in the population (or sample). Another way to put it is that each individual in
the sample receives the same area in the picture.
3
Categorical Variables
There are many different types of pictures that can be drawn of distributions, but the
two most common are the bar chart and pie chart.
In a bar chart, the categories of the variable are listed on the horizontal axis, and the
vertical axis is labeled with either the frequency, relative frequency, or percent. A
bar is drawn over each category whose height is equal to the frequency (or relative
frequency or percent). The bars should be the same width, and they should not
touch. The order that the categories are listed in does not matter. Because of this,
there is no shape associated with a categorical variable—if you list the categories in
a different order, the overall shape of the picture will change. The most common
ordering is to list them in order of decreasing magnitude, which is called a Pareto
Chart. The space is put between the bars to indicate to the eye that the bar is not in
any way related to the ones on either side of it.
Note: All of the pictures pasted in my notes are from The Introduction to the
Practice of Statistics, 7th Ed., by Moore, McCabe, & Craig.
As long as the bars in a bar chart are the same width, the area principle will hold. If
Category A has twice as many observations as Category B, its bar will be twice as
high, and since the widths are the same, it will also have twice as much area.
Notice that for this to actually happen, the vertical axis must start at 0. Sometimes,
if all of the numbers are large, you may want to start the labels at a number different
4
from 0. It is OK to do this as long as you tell the reader you are doing it—a clear
break should be put in the graph to show the eye that part of the picture is missing,
and therefore the Area Principle no longer applies. You then know that you have
to adjust for the missing areas. One of the primary misuses of statistics is to employ
a misleading graph, where the labeling does not start at 0 and there is no break to
make it obvious that this is done.
Another problem with bar charts that the advent of technology has allowed is 3dimensional bar charts. While they may be pretty to look at, they can be difficult to
interpret because seeing the sides of the bars can change the relative areas, making
some categories appear to be either more or less prominent than they actually are.
The picture of a bar chart done with frequencies will look exactly the same as one
done with relative frequencies or percents. The only thing that will change is the
labeling of the vertical axis; the overall relative areas of the bars will be exactly the
same.
A pie chart can also be used to display the distribution of a categorical variable. In
a pie chart, a circle is broken into wedges, the areas of which are relative to the
frequencies in the sample. To get the angle at the tip of a wedge, simply multiply
the relative frequency by 360 degrees. Pie charts also obey the Area Principle—if
the slice is twice as big for Category A as for Category B, then Category A has
twice as many observations. This breaks down in a 3-D pie chart—while they look
snazzy, the Area Principle is violated and hence the picture can be deceptive.
5
Numerical Variables
Working with numerical variables is a bit more complex because now there is only
one possible ordering of the values—the horizontal axis also will now be part of a
number line. Because of this, there is only one shape associated with a numerical
variable, and that will become the first thing that we are interested in. The Area
Principle still holds—the area of the picture over any interval on the number line
will be proportional to the number of observations in that interval.
When describing the shape of a distribution, you need to address three things—
symmetry, modes, and outliers. This is always the first thing you do when planning
any type of data analysis, because we will find later on in the semester that the
proper type of analysis often depends on the shape of the distribution.
A distribution is symmetric if the left and right sides are mirror images of each
other. If you fold it in half, the two sides will match up.
If one side or tail is longer than the other, then the distribution is said to be skewed
in the direction of the longer tail. If the right tail is longer, the distribution is said to
be skewed to the right. This means that the curve is higher over the lower numbers,
and hence relatively low numbers are more likely than relatively large numbers. An
example of a variable that would be skewed to the right is household incomes.
They cannot go below 0, but there is no upper limit. For most households, the
income will be relatively small—say less than $100,000—but there are some that
are much higher. Bill Gates’ household is in there also, far off to the right, so the
right tail is extremely long. Most variables that deal with money are skewed to the
right.
If the left tail is longer, the distribution is skewed to the left. This means that
relatively high values are more likely than the lower values. An example of a
distribution that is usually skewed left is test grades—most grades are on the high
end—centered around 75—but they can go as low as 0.
A mode of a distribution is a peak. If there is only one peak, the distribution is said
to be unimodal. If there are two peaks, it is bimodal. If there are more than two, we
usually just say it is multimodal. When you see more than one peak in a
distribution, it usually indicates that there is more than one subgroup in the
population. For example, the distribution of the variable X = heights of adults
might have two peaks—one at the average height of men and the other at the
6
average height of women. In this case, the reason for the two peaks is obvious, but
this may not always be the case. The distribution of mean SAT scores by state is
bimodal. The two subgroups are the states that require the SAT for college
admission and the group of states that don’t. Mean SAT scores are higher in states
that don’t require the SAT for college admission because only the best students in
those states tend to take it, while a very large percentage of students take it in states
that require it.
Here is another example, from Example 1.14 on p. 15. This is the histogram on the
length of all 31492 calls made to the customer service center of a small bank in a
month:
This histogram has a classic skewed right shape, except for the spike at the
beginning, which actually makes it bimodal. Obviously something strange is going
on here! An investigation found that representatives were penalized if their average
call length was too high, so some of them were just hanging up on callers to bring
their average down. When the bank stopped penalizing them, the spike at the
beginning disappeared.
The third thing to look for is gaps or outliers. An outlier is a data point that falls
away from the others, outside of the overall pattern. It is very important to identify
these, because often they are a result of an incorrectly recorded value. The first
thing you should do if you find an outlier is determine if it is actually a correct data
value. If it is, then you will have to consider the fact that it is there when you plan
the analysis.
7
One more comment about the shape of a distribution: it is what it is. Later on in
this class, the analysis will depend on the shape, and symmetric distributions are our
favorites. That doesn’t mean that we can force the distribution to be the shape that
we want. Rather, we have to adjust our analysis to whatever the inherent shape of
the distribution is.
The two most common types of pictures to draw of numerical data sets are a stem
and leaf display (or stemplot) and a histogram. A histogram is the numerical
version of a bar chart, where the horizontal axis now becomes a number line and the
bars touch. Since there is only one ordering on the number line, the shape now
matters. Other than that, the characteristics of the histogram are the same as for the
bar chart.
A stemplot is found by breaking the numbers into stems and leaves, which should
be clearly labeled. The stems are then listed vertically, with a line next to them, and
the leaf for each number in the data set is listed after its stem. There are a couple of
variations for stemplots that can be done. For large numbers you can either truncate
(trim) the data values or use multi-digit leaves; if you do that latter, you need to put
a space between them for different data values.
8
It is also possible to “split” the stems, which actually means using more than one of
each. To keep the Area Principle intact, there must be the same number of possible
leaves for each listed stem. If there is one of each stem, digits 0-9 will appear after
each; or there can be two of each stem with digits 0-4 after the first and 5-9 after the
second; or five of each stem with digits 0-1 after the first, 2-3 after the second, etc;
or 10 of each stem with only one possible digit for each. The latter is equivalent to
doing a tally or dot plot, so it would never actually be done. If your stemplot is too
bunched together—for example, if some of the rows of leaves are longer than the
column of stems—you can split the stems to spread it out. An example is given
below.
Two data sets can be compared by using back-to-back stemplots. (This can be done
with a histogram also.) You use one column of stems, but have leaves going to the
left for one data set and to the right for the other. (For a histogram, there is one axis
with the bars going up and down for the two different data sets.)
The following stemplots from Examples 1.11-12 on pp. 10-11 demonstrate these
ideas. The data is 25 hydroxy vitamin D in ng/ml for a sample of 20 adolescent
girls and 20 adolescent boys. Stems are tens and leaves are ones.
9
Note: Stemplots give the exact same shape information as histograms! A stemplot
rotated 90˚ has the exact same information as a histogram with interval widths of
10:
Which type of picture is better, the histogram or stemplot, depends on the data set
and how you are going to use it. These can be summarized as follows.
Advantages of a stemplot over a histogram
1. For small data sets, it is much easier and quicker to do.
2. If the leaves are arranged in order, it is a fast way to rank the numbers from
smallest to largest, which is necessary to find percentiles.
3. It is possible to recover (at least part of) the original data, which cannot be done
with a histogram.
Advantages of a histogram over a stemplot
1. They can be used for very large data sets, when a stemplot would not be
plausible.
2. You have complete control over the interval widths—you can make them
anything you want, as opposed to being stuck with widths of 10, 5, 2, or 1.
10
Timeplots
There is one other type of data display that gives completely different information
than the stemplot or histogram. These both tell you the distribution of the variable,
but they give you no information whatsoever about how the variable evolves over
time. If you want the latter, you need to do a timeplot.
A histogram has the values of the variable on the horizontal axis and either
frequency, relative frequency, or percent on the vertical axis. A timeplot puts the
values of the variable on the vertical axis, and the horizontal axis is some
representation of time—either order of the observations, or an actual time unit such
as minutes, hours, days, months, years, or decades. A timeplot gives you no
information about the shape of the variable, but rather tells you which times have
smaller values and which times have larger values. When looking at a timeplot,
you are looking for trends—a general up or down movement; cycles—a recurring
up and down movement, such as some type of seasonal variation; and also
departures from the overall pattern. Any time you see a departure, you should look
for a reason for it.
11
One more overall comment about drawing pictures of data, and this is true for any
picture that has axes (bar graph, histogram, timeplot, etc.): It is possible to change
the way it looks by changing the scale. When looking at pictures given in books,
newspapers, magazines, websites, etc., you should always pay close attention to the
scale. If there is no scale given, you should not believe what it indicates without
further investigation. Small changes can be made to look large by stretching the
vertical axis, and large changes can be made to look small by shortening it.
Download