14-1
Data Preparation
• Data preparation refers to the process of ensuring the accuracy of the data and their conversion from raw form into classified forms appropriate for analysis.
14-2
Validation
Editing
Coding
Data Entry
Data Cleaning
Tabulation & Analysis
14-3
14-4
A questionnaire returned from the field may be unacceptable for several reasons.
– Parts of the questionnaire may be incomplete .
– The pattern of responses may indicate that the respondent did not understand or follow the instructions .
– The responses show little variance .
– One or more pages are missing .
– The questionnaire is received after the preestablished cutoff date .
– The questionnaire is answered by someone who does not qualify for participation .
– Validation & Editing help in preparing data for data entry
• It is the process of ascertaining whether the interviews conducted complied with specified norms
• It helps in detecting any fraud or failure by interviewer to follow specified instructions
• Questionnaire has a separate place to record respondent’s name, address, telephone number
& other demographic details & date of interview
• It is the basis for “validation” to confirm if the interview was really conducted
14-5
14-6
• It is the process of checking for mistakes by interviewer or respondent in filling the questionnaire
• It is a manual process which is generally done twice, first by the service firm which conducted interviews & second by the market research firm
• The first check is generally done by the field supervisors in the field itself
• Problems to be checked in editing involve
- Finding out whether the interviewer has followed skip pattern
- Whether responses to open ended questions have been properly obtained
14-7
During editing some illegible, incomplete, inconsistent or ambiguous responses are found which are called unsatisfactory responses .
• Treatment of Unsatisfactory Results
– Returning to the Field – The questionnaires with unsatisfactory responses may be returned to the field, where the interviewers recontact the respondents.
– Assigning Missing Values – If returning the questionnaires to the field is not feasible, the editor may assign missing values to unsatisfactory responses.
– Discarding Unsatisfactory Respondents – In this approach, the respondents with unsatisfactory responses are simply discarded.
14-8
• Coding : It is the process of assigning a symbol, usually a number, to each possible response to each question.
• Coding is necessary for efficient data analysis
• Categorization of responses to be done for the purpose of coding should be:
-Appropriate :If income is important variable wider income classification may not be appropriate
-Exhaustive :Should list all possible alternatives
-Mutually Exclusive : Responses should not fit into more than one category
.
• Coding closed ended questions is easy since there are a definite number of predetermined
14-9 responses
• Closed ended questions are generally precoded & hence intermediate step of framing the codes prior to data entry can be avoided
• Coding the data from open ended questions is much more difficult as responses are unlimited
& vary
14-10
• Guidelines for coding unstructured questions :
• Category codes should be mutually exclusive and collectively exhaustive.
• Only a few (10% or less) of the responses should fall into the “other” category.
• Category codes should be assigned for critical issues even if no one has mentioned them.
• Data should be coded to retain as much detail as possible.
14-11
Content Analysis for open ended questions
• Qualitative technique used to analyze text provided in the response category of open ended questions
• It systematically & objectively derives categories of responses that represents homogeneous thoughts & opinions
• It identifies responses particularly relevant to the survey
• It requires the researcher to name categories through a detailed examination of data ( as against pre-coding)
• It is an iterative interpretation process of first reading the responses, then rereading them to establish meaningful categories
• The number & meaning of categories are further refined so that it is most representative of the respondents’ text
• Each response is classified into as many categories as necessary to capture full picture
• Responses out of context of the question are not coded
A codebook contains coding rules and the necessary information about each variable in the survey. A codebook generally contains the following information
• question number ---(3)
• variable number ----(4)
• variable name ----(Brand)
• instructions for coding--- 1=Amul
2=Cadbury
3=Nestle
14-12
14-13
• Don’t know is included in possible answers
• Respondents choose this either because they genuinely don’t know or because they don’t want to answer
• A considerable number of DK responses may be generated for some questions
• Researcher can either ignore them or allocate the frequency to all other responses in the ratio they occur
• How many chocolates you eat in a typical week?
300(<20):200(>20):50(DK)
• 330(<20):220(>20)
14-14
• Data entry involves transferring coded data from questionnaires or coding sheets into computers through keypunching
• Data collected through CATI or CAPI are entered directly into computer
• Besides keypunching data can be transferred using optical scanning, mark sense forms or computerised sensory analysis
• Optical scanners can read responses on questionnaires. They can read darkened small circles & process marked answers .Used in correction of papers in competitive exams. Transcription of UPC data at checkout counters in supermarkets
• Mark sense forms require responses to be recorded with special pencil in a predestinated area coded for that response. The data can then be read by a machine
• Computerised sensory analysis automate data collection process. Questions appear on a computerised gridpad & responses are recorded directly into computer using a sensing device
Data Cleaning
14-15
Data Cleaning
Consistency checks identify data that are out of range, logically inconsistent, or have extreme values.
– Computer packages like SPSS, SAS, EXCEL and MINITAB can be programmed to identify out-of-range values for each variable and print out the respondent code, variable code, variable name, record number, column number, and out-of-range value.
– Extreme values should be closely examined.
14-16
Data Cleaning
• Substitute a Neutral Value – A neutral value, typically the mean response to the variable, is substituted for the missing responses.
• Substitute an Imputed Response – The respondents' pattern of responses to other questions are used to impute or calculate a suitable response to the missing questions.
• In casewise deletion , cases, or respondents, with any missing responses are discarded from the analysis.
• In pairwise deletion , instead of discarding all cases with any missing values, the researcher uses only the cases or respondents with complete responses for each calculation.
• After data cleaning computer data file is deemed clean &ready for analysis
14-17