A Short Introduction Prepared by Mirya Holman ` There are three kinds of data ◦ Qualitative ◦ Quantitative ◦ Ordinal ◦ Qualitative (also called ordinal) data is distinguished by being a set of unordered categories. x Qualitative variables differ in quality, not quantity or magnitude x Examples: Race, gender ◦ Quantitative (or interval) data varies in magnitude. x Each possible value of a quantitative variable is greater than or smaller than any other possible value. x Examples: Education, income x Qualitative data can either be… x discrete , if it can take on a finite number of values x The number of visits to the dentist last year x or continuous, if it can take an infinite continuum of possible real number values x The number of minutes it takes to finish a book ` Ordinal data consists of categorical scales that have a natural ordering of values ◦ It does not have defined interval distances between the values. ◦ Ordinal data is usually transformed into interval data, or data that contains categorical scales with a defined interval distances between the values ◦ Examples: Political identification (Strong Democrat to Strong Republican) or Class (low, middle, high). ` Coding variables is a way to change qualitative data to quantitative data ` ` We normally do this to perform statistical analysis on the qualitative data Coding a variable consistently assigns a numerical value to qualitative trait ◦ Example: Gender is a qualitative trait (or a variable without a natural ordering) ◦ We can assign male and female each a numerical value (say, zero and one). Now we have numbers to do statistics with! ` We code the variables for 3 primary reasons: ◦ 1: We can run statistical models ◦ 2: Our computer programs will understand the variables ◦ 3: Accountability – we can run models “blind,” or without knowing what variables stand for, in order to reduce programming / author bias. ` Say that we want to look at employment discrimination settlements ◦ We are interested in whether the type of representation has an effect on the outcome of the case. ◦ We look at four types: Pro se, EEOC, appointed council, and other. Now, these are qualitative data. ◦ But! We want to know what effect the type of representation has on the amount received in a settlement ` So… we assign consistent numerical values to each type of representation, so that… ◦ ◦ ◦ ◦ Pro se = 1 EEOC = 2 Appointed council = 3 Other = 4 ` Now we can run an ANOVA test, which will statistically compare the mean settlement amount for each type representation, and determine whether the differences are statistically significant. ◦ NOTE: Statistically significant, in this and many other applications, means that any difference you find can be attributed to differences within the data, and cannot be attributed to chance. ` ` ` Asbestos cases: I want to investigate whether the nature of asbestos litigation changed between 1992 and 2001. How? By Coding! Example 2… ` What is the process? ◦ Step 1: Each case is entered into a spreadsheet, including information on the number of plaintiffs, the number of defendants, the award amount (if any), the type of award(s), the claim, etc. ◦ Step 2: Each time we deal with a qualitative element of the case, we transform that into a quantitative descriptor ◦ Step 3: We can run statistical analysis on the data Example 2… ` ` Case # 320278 01L781 98-1386 ` A How To: This is what the data looks like when we enter it in: Plaintiff Defendant Award DAVID and Susan TAYLOR JOHN CRANE INC 3029849 James and Terry Crawford ACandS Inc., et al 16000000 593000 Andrew and Marietta Prebehall Harbison & Walker Co This is in qualitative form! Type of Award Claim compensatory, loss of consortium mesothelioma compensatory, punitive, loss of consortiuMesothelioma wrongful death, loss of consortium Lung cancer Example 2… ` ` We want to code the data, to transform it into quantitative data… so, let’s start with the claim: We decide that we are going to consistently assign each type of claim a numerical identifier: Case # 320278 01L781 98-1386 ` PlaintiffDefend Award DAVID JOHN C 3029849 James anACandS 16000000 Andrew Harbiso 593000 Type of Award Claim compensatory, loss of consortium mesothelioma compensatory, punitive, loss of consortiuMesothelioma wrongful death, loss of consortium Lung cancer Claim2 The number we assign does not matter as much as the consistency in which we assign the code. 1 1 3 Example 2… ` ` Next, we tackle damages. Here it is easier to make separate columns for each type of damage, and then indicate with a 0/1 whether that damage was awarded: Award ####### 1.6E+07 593,000 Type of Award compensatory, loss of consortium compensatory, punitive, loss of consortium wrongful death, loss of consortium Compensatory Punative Loss of consortium wrongful death 1 0 1 0 1 1 1 0 0 0 1 1 Example 2… ` ` We can leave the damages amount alone, since it is already in numerical form We can transform the plaintiffs, by coding the number of defendants or the type of plaintiffs. Case # 320278 01L781 98-1386 ` Plaintiff DAVID and Susan TAYLOR James and Terry Crawford Andrew and Marietta Prebehalla Num_plt Type_plt 2 2 2 Defendant Award 2 JOHN CRANE 3029849 2 ACandS Inc., et 16000000 593000 2 Harbison & Wal Here, all our plaintiffs are married couples, so there are 2 plaintiffs, and we give them a code of “2.” We could, for example, give a single plaintiff a code of “1” and a surviving spouse, who is suing for the estate, a code of “3.” Example 2… ` ` ` ` Codebook! When we are coding, it is important to keep track of what we code, and how we code it. This is usually kept in a codebook, which documents what each variable means. So, for the asbestos cases, our codebook would include: ◦ Type_plt = Type of plaintiff. 1= single plaintiff. 2= married plaintiffs. 3=surviving spouse, suing on behalf of the estate. Example 2… ` Now we have the data in a form which allows us to model or manipulate it, in order to better understand trends and relationships. Final thoughts ` In order to code correctly, we MUST: ◦ Be Consistent in our coding x i.e. if female =1 once, female =1 always ◦ Know what you are coding! x Coding is NOT an exact science in most circumstances x Knowing the context can help you determine where to put a case / plaintiff / award when it does not exactly fit your categories ◦ When in doubt, have someone code a sample of your data, and see the level of consistency. ◦ Keep track of what you do! Use a codebook! ◦ This is an intuitive process, and everyone makes mistakes! Take your time!