Remove Records - Customer and Business Analytics

advertisement
Page 78: If the selected records are not removed, the chi-square values are not calculated.
Unfortunately, the Record Removal method described in step 10 is painful. Here’s
A Useful Bit of Code for Removing records from the data efficiently
First, the R Commander limitation to be aware of:
In the Data Clean menu, the two remove records options:
Remove records with missing data
Remove selected records
are problematic:
The “--Remove records with missing data” menu in the Rcmdr GUI only operates properly when
the “include all variables” box is checked.
The “Remove selected records” menu, introduced on page 78 in step 10 of the Contingency Table
tutorial requires a messy manual process.
When you want to remove a record (i.e. a row) from a data set if a variable takes on a certain
value, it can be done with two lines of script easily. Suppose you want to remove a row from a
dataset (e.g., jack.jill) when a variable (e.g. Age) takes a particular value (e.g. “No female head”).
Enter the line below into the script window of Rcmdr, highlight it, and press submit. It’generally
a good idea to write a new data set with a new name, e.g., newdataset, and keep the old data
rather than overwriting when you make changes that might be painful to undo.
First line
newdataset <- subset(dataset, subset = (variable != value))
If the value is character, as is common with categorical variables, remember to put the value in
quotes to tell R that these are to be treated as characters. A continuous variable does not need
quotes around the value.
For example, if you want to remove all the cases in the dataset jack.jill where the Age variable is
“No female head”:
NfhAgejack.jill <- subset( jack.jill, Age != "No female head")
View the data set to see that this has removed the cases.
(!= means ‘not equal to’) .
This writes all cases where there is no female head of the household to the new dataset,
NfhAgejack.jill. When you are working with a continuous variable, you are done. However, as in
this case where you have a categorical “factor” variable, the information that there was once a
level called “No female head” is still associated with the new data set.
This should be removed with a second line of code (make sure that the active data set
NfhAgejack.jill is selected on the Data set button)
Second line
newdataset$variable <- factor(newdataset$variable)
For example, for the Age variable,
NfhAgejack.jill$Age <- factor(NfhAgejack.jill $Age)
Using the Rcmdr Explore and Test Summarize Active data set, you should see that Age no
longer has the “No female head” category. (R still thinks it exists for the other categorical
variables, though). If so, now you can use Age and Spend.Cat in a contingency table and get the
chi-square calculation and p-value.
“No female head” occurs in other factor variables (Employment, Education, etc.) as well, but
they are the same cases so they have been removed. However, you will have to execute the
second line of code for each to get rid of the meta-information before you can use them.
 Note1
Aside; If you want the newdataset to keep records with a value rather than remove, use the
double equals comparison == rather than the not equals !=
newdataset <- subset(dataset, subset = (variable == value))
 Note 2
NA, the missing value code, is not valid when using comparison operators, so this won’t work
when you want to remove cases when the value of specific variable is missing (that’s what you
would expect the “Remove cases with missing data” command to do when the “include all
variables” box is unchecked). One way around this is to recode the NA to a different value, say
“Miss” for categorical variables, or 99999 for continuous variables, then use the script above.
Download