Variable Coding

advertisement
A Short Introduction
Prepared by Mirya Holman
`
There are three kinds of data
◦ Qualitative
◦ Quantitative
◦ Ordinal
◦ Qualitative (also called ordinal) data is distinguished
by being a set of unordered categories.
x Qualitative variables differ in quality, not quantity or
magnitude
x Examples: Race, gender
◦ Quantitative (or interval) data varies in magnitude.
x Each possible value of a quantitative variable is greater
than or smaller than any other possible value.
x Examples: Education, income
x Qualitative data can either be…
x discrete , if it can take on a finite number of values
x The number of visits to the dentist last year
x or continuous, if it can take an infinite continuum of
possible real number values
x The number of minutes it takes to finish a book
`
Ordinal data consists of categorical scales
that have a natural ordering of values
◦ It does not have defined interval distances between the
values.
◦ Ordinal data is usually transformed into interval data, or
data that contains categorical scales with a defined interval
distances between the values
◦ Examples: Political identification (Strong Democrat to
Strong Republican) or Class (low, middle, high).
`
Coding variables is a way to change
qualitative data to quantitative data
`
`
We normally do this to perform statistical analysis on the
qualitative data
Coding a variable consistently assigns a
numerical value to qualitative trait
◦ Example: Gender is a qualitative trait (or a variable
without a natural ordering)
◦ We can assign male and female each a numerical
value (say, zero and one). Now we have numbers to
do statistics with!
`
We code the variables for 3 primary reasons:
◦ 1: We can run statistical models
◦ 2: Our computer programs will understand the
variables
◦ 3: Accountability – we can run models “blind,” or
without knowing what variables stand for, in order
to reduce programming / author bias.
`
Say that we want to look at employment
discrimination settlements
◦ We are interested in whether the type of
representation has an effect on the outcome of the
case.
◦ We look at four types: Pro se, EEOC, appointed
council, and other. Now, these are qualitative data.
◦ But! We want to know what effect the type of
representation has on the amount received in a
settlement
`
So… we assign consistent numerical values to
each type of representation, so that…
◦
◦
◦
◦
Pro se = 1
EEOC = 2
Appointed council = 3
Other = 4
`
Now we can run an ANOVA test, which will
statistically compare the mean settlement
amount for each type representation, and
determine whether the differences are
statistically significant.
◦ NOTE: Statistically significant, in this and many
other applications, means that any difference you
find can be attributed to differences within the
data, and cannot be attributed to chance.
`
`
`
Asbestos cases:
I want to investigate whether the nature of
asbestos litigation changed between 1992
and 2001.
How? By Coding!
Example 2…
`
What is the process?
◦ Step 1: Each case is entered into a spreadsheet,
including information on the number of plaintiffs,
the number of defendants, the award amount (if
any), the type of award(s), the claim, etc.
◦ Step 2: Each time we deal with a qualitative element
of the case, we transform that into a quantitative
descriptor
◦ Step 3: We can run statistical analysis on the data
Example 2…
`
`
Case #
320278
01L781
98-1386
`
A How To:
This is what the data looks like when we enter it in:
Plaintiff
Defendant
Award
DAVID and Susan TAYLOR JOHN CRANE INC
3029849
James and Terry Crawford
ACandS Inc., et al
16000000
593000
Andrew and Marietta Prebehall Harbison & Walker Co
This is in qualitative form!
Type of Award
Claim
compensatory, loss of consortium
mesothelioma
compensatory, punitive, loss of consortiuMesothelioma
wrongful death, loss of consortium
Lung cancer
Example 2…
`
`
We want to code the data, to transform it into
quantitative data… so, let’s start with the claim:
We decide that we are going to consistently assign
each type of claim a numerical identifier:
Case #
320278
01L781
98-1386
`
PlaintiffDefend Award
DAVID JOHN C
3029849
James anACandS 16000000
Andrew Harbiso
593000
Type of Award
Claim
compensatory, loss of consortium
mesothelioma
compensatory, punitive, loss of consortiuMesothelioma
wrongful death, loss of consortium
Lung cancer
Claim2
The number we assign does not matter as much as
the consistency in which we assign the code.
1
1
3
Example 2…
`
`
Next, we tackle damages.
Here it is easier to make separate columns for each
type of damage, and then indicate with a 0/1
whether that damage was awarded:
Award
#######
1.6E+07
593,000
Type of Award
compensatory, loss of consortium
compensatory, punitive, loss of consortium
wrongful death, loss of consortium
Compensatory Punative Loss of consortium wrongful death
1
0
1
0
1
1
1
0
0
0
1
1
Example 2…
`
`
We can leave the damages amount alone, since it is
already in numerical form
We can transform the plaintiffs, by coding the
number of defendants or the type of plaintiffs.
Case #
320278
01L781
98-1386
`
Plaintiff
DAVID and Susan TAYLOR
James and Terry Crawford
Andrew and Marietta Prebehalla
Num_plt
Type_plt
2
2
2
Defendant
Award
2 JOHN CRANE
3029849
2 ACandS Inc., et
16000000
593000
2 Harbison & Wal
Here, all our plaintiffs are married couples, so there
are 2 plaintiffs, and we give them a code of “2.” We
could, for example, give a single plaintiff a code of
“1” and a surviving spouse, who is suing for the
estate, a code of “3.”
Example 2…
`
`
`
`
Codebook!
When we are coding, it is important to keep
track of what we code, and how we code it.
This is usually kept in a codebook, which
documents what each variable means.
So, for the asbestos cases, our codebook
would include:
◦ Type_plt = Type of plaintiff. 1= single plaintiff. 2=
married plaintiffs. 3=surviving spouse, suing on
behalf of the estate.
Example 2…
`
Now we have the data in a form which allows
us to model or manipulate it, in order to
better understand trends and relationships.
Final thoughts
`
In order to code correctly, we MUST:
◦ Be Consistent in our coding
x i.e. if female =1 once, female =1 always
◦ Know what you are coding!
x Coding is NOT an exact science in most circumstances
x Knowing the context can help you determine where to put
a case / plaintiff / award when it does not exactly fit your
categories
◦ When in doubt, have someone code a sample of
your data, and see the level of consistency.
◦ Keep track of what you do! Use a codebook!
◦ This is an intuitive process, and everyone makes
mistakes! Take your time!
Download