Data Entry - NMSU College of Business

advertisement
Data Entry
Slide 1
The goal for this lecture on data preparation is to summarize the steps that researchers perform
in transforming their data from its collected form into its computerized form.
Slide 2
As is often the case, I need to begin with some basic terminology. A file is a complete data set.
A record is a summary of the responses by one individual or data concerning one object. A field
is the individual variables within a record, such as questionnaire number, street, city, or
response to question one.
Slide 3
The end product of the data preparation process is called the data matrix. If you’re familiar with
spreadsheets such as Quattro or Excel, then what you have created on many occasions is a
data matrix. Notice for data matrices in marketing that each record is a row and the different
fields, which correspond to the different variables in a data set, are represented by the columns.
Slide 4
Simply defined, data entry is the process of transferring data from sources like questionnaires
into a computer database. Often that data is numeric, in which case the subsequent analysis is
statistical in nature. Other times that data is alpha or word form, such as in content analyses
and subsequent analysis is done with specialized programs that do word counts and the like.
Slide 5
I’ll now discuss the following five steps in data preparation: validation, editing, coding, data
entry, and machine cleaning of data.
Slide 6
In the data collection stage, researchers check that interviews were conducted as specified.
They ensure all respondents were qualified to participate in the study. If personal interviews
were conducted, interviewers must be monitored for professional behavior and appearance.
Researchers should check that interviews were conducted in the proper environment; for inoffice or in-home interviews, the researcher must ensure that they were not conducted at a local
tavern! Finally, the validation process includes checking that all appropriate questions were
asked and that interviewers didn’t skip certain questions for inappropriate reasons.
Slide 7
The next stage in the data preparation process is the editing stage. This differs by type of
interview. For personal interviews, the researcher should check for (1) omissions, which are
improperly unanswered questions, (2) ambiguities, which are vague responses to open-ended
questions, and (3) inconsistencies, which are incompatible answers to different questions. (For
example, the answer to one question indicates childlessness but the answer to another question
indicates a parent to several children.) The researcher also should check for proper skip
patterns: answers to questions that should have been skipped should be deleted, and efforts
should be made to acquire answers to questions that should have been answered. Finally, the
researcher should check that answers were recorded properly, especially to open-ended
Page | 1
questions. As it’s difficult for personal interviewers to record extensive answers, some
completed interview forms may contain only one-or two-word answers, when it’s obvious
respondents provided much more detail.
Slide 8
In editing self-administered questionnaires, researchers should check that all key questions
were answered. For example, if an important aspect of the data analysis phase is to compare
the responses of males versus females, but a respondent failed to indicate his/her sex, then that
questionnaire is useless. Researchers also should check that respondents understood the
instructions and took the questionnaire seriously, which isn’t always an easy call. I’ve conducted
surveys in which respondents have circled the identical answer on a 7-point scale all the way
down a page. Were those answers truly reflective of that respondent’s opinions, or was that
respondent merely providing a meaningless response? I assumed the answers reflected the
respondent’s true opinions. Finally, researchers should check for no missing pages, especially
for lengthy questionnaires, and that questionnaire was returned before the cutoff date. Things
that happen in the world can influence people’s responses, so responses to questionnaires
returned after a cutoff date for submission may differ because of environmental changes.
Slide 9
The editing process is subjective. How does a researcher know that a respondent took the task
seriously? How does a researcher know what was meant by an ambiguous response? There’s
no right or wrong way to cope with such editing problems, but there are some general solutions.
If the respondent is not anonymous, then he or she should be re-contacted in an effort to
resolve response ambiguities or inconsistencies. Another non-preferred alternative is to discard
the questionnaire. Given the costs per completed questionnaire, discarding a questionnaire is
expensive, so researchers tend to avoid it. If the questionnaire seemingly was taken seriously,
but some responses are problematic and the respondent is anonymous, then a third approach is
to use only the good items. Although this practice has data implications, the implications are
beyond the scope of this course.
Slide 10
The first stage in the preparation process for data is coding, which is the process of grouping
and assigning numeric codes to different question responses. This process is easier when
dealing with close-ended questions because they’re pre-coded.
Slide 11
Here’s a reminder of what I mean by a pre-coded questionnaire. There are numbers adjacent to
each of the answers that respondents might provide, and those numbers are entered into the
computer database.
Slide 12
Exploratory research may require one or more open-ended questions. The way researchers
deal with answers to those questions is as follows:
(1) Generate a complete list of all reasonable responses to each question.
(2) Have multiple judges develop response categories for each question that are mutually
exclusive and exhaustive. Developing such categories is a subjective process, so each
Page | 2
judge is like to develop a unique set. Differences between those sets must be resolved
by the judges and researcher(s) because only one set can be used for coding.
(3) After the judges, coders, and researcher(s) agree on the consolidated response list, a
numeric code is assigned to each response.
(4) Have the coders return to the completed questionnaires, read each response, and
assign one or more numeric codes that correspond to the consolidated response list.
This process is involved and complex, which is why I recommend that for non-exploratory
survey research you take the time to develop close-ended questionnaire items, because coding
open-ended questions is far more difficult than creating good close-ended questions.
Slide 13
Here’s an example of a portion of a code book for a travel study. The questions are relatively
straightforward, so the codes are relatively simple.
Slide 14
Stage four of the data preparation process is data entry, in which dated, validated, and coded
questionnaires are given to the person who will enter the data into the computer database. It’s
far more accurate and efficient to go directly from questionnaires to data entry and storage, as
opposed to transcribing the data onto coding sheets and then entering the content of those
sheets into the database. That intermediate step only can introduce additional error.
Slide 15
If specialized forms are used, such as mark sense forms, then it’s possible to avoid the data
entry operator and to scan responses directly into a computer database. Alternatively, optical
scanning of questionnaire responses—similar to the way the post office scans the address
you’ve written on an envelope and routes your letter to the appropriate post office—is possible.
Regardless, questionnaire data ultimately must be transferred to a computer database.
Slide 16
The data entry process can be either intelligent or dumb. In intelligent data entry, there’s
checking of entered data for internal logic by the data entry device or other connected device.
Excel, Quattro, and SPSS rely on dumb data entry. If intelligent data entry relies on software,
then the software can be prepared to recognize inappropriate responses. For example, if the
response to a question can be only ‘1’, ‘2’, or ‘3’, and the data entry person types a ‘4’, then the
computer will not accept that keystroke as a valid response. Dumb data entry, because it lacks
error-checking protocols, may necessitate subsequent data cleaning.
Slide 17
With machine cleaning of data, software such as SPSS can be programmed to identify and
suggest fixes for logical errors. For example, it would be illogical for a respondent to indicate in
an answer to one question that he’s single and never been married, but in an answer to another
question indicate a spouse’s preferences. SPSS can be programmed either to delete or merely
flag inconsistent answers (allowing researchers to decide the appropriate fix on a case-by-case
basis). In addition, it’s possible for software such as SPSS to generate marginal reports, which
is a table of response frequencies for each question. Using SPSS in this way makes it easy to
Page | 3
identify invalid keystrokes and improper skip patterns. For example, marginal reports indicate
the number of responses to each question; if there are seemingly too many valid responses to a
legitimately skipable question, then the researcher would be alerted to a skip pattern problem.
Slide 18
This slide shows the logic that would be programmed in to a set of machine-cleaning
instructions for SPSS. For example, there are only eight valid city codes, the respondent’s ID
can’t exceed 9999, and neither negative numbers nor alpha characters are valid.
Slide 19
Those are the five steps in the data preparation process. Now, in finishing this lecture, I’ll briefly
discuss recoding data and ways to cope with missing data.
Slide 20
Recoding data means using computers to convert original codes, used for the raw data, into
codes that are more suitable for subsequent analysis. Perhaps reverse coding is needed. In that
case, the researchers needs to identify each question—probably an attitude question—that are
worded ‘in reverse’; for a 7-point scale, the response is subtracted it from 8. There are other
ways to recode data. Although many marketing relationships are non-linear, many frequently
used statistical analyses assume linear relationships; hence, it may be necessary to transform
the data into a linear form. It may also be necessary to recode data for complex socioeconomic
variables, such as social class, which are comprised of several components (level of education,
income, occupational status, and other indicators).
Slide 21
Another type of data recoding entails reducing or collapsing the number of response categories.
Here’s an example from a Likert Scale. The original scale has five points: strongly agree, agree,
neither agree nor disagree, disagree, and strongly disagree. The bottom, left-hand side of the
table in the slide shows the percent of respondent who gave each answer. It’s possible to
collapse this scale from five points to three points by combining the ‘strongly agree’ and ‘agree’
answers and combining the ‘disagree’ and ‘strongly disagree’ answers. Such collapsing of
variables may be necessary for cross-tabulation analysis.
Slide 22 (No Audio)
Slide 23
Although the complexities of missing data are beyond the scope of this course, let me at least
alert you to a couple of issues and suggest ways to handle missing data. Missing data is often
not randomly distributed. For example, refusal to answer an income question is related to both
gender and education level.
Slide 24
Missing data is certainly related to the type of question; the more personal and potentially
threatening the question, the more likely a non-response. Here’s an example of the relative
frequencies of non-response to different items. Questions about sex and income are the most
likely to be unanswered. Questions about age, occupation, and marital status are less likely to
be unanswered.
Page | 4
Slide 25
How should missing responses be handled? The possibilities include:

Leave the response blank and allow the computer to record it as a missing response.

If data for any question is missing, then all the data for that person (or record) could be
deleted.

Pairwise deletion also is possible. If the correlation between two variables is of interest,
and either of those variables has missing data, then the data for that person (or record)
is excluded.

The mean response—either for the entire sample or the subgroup from which the
respondent was drawn—could be substituted for the missing value. This method
assumes that the overall or group mean is the best guess for that person’s response had
he or she opted to answer the question.

An imputed response—in which the respondent’s answers to questions that are most
highly related, in a statistical fashion, to the question without data—could be calculated
and substituted for the missing response.
Each of these approaches has pluses and minuses; many have statistical implications that are
beyond the scope of this course. If you are required to conduct a marketing research study, I
recommend that you leave missing data blank in your database.
Page | 5
Download