Module I2 Session 1&2 Module I2 Session 1 and 2 Transferring data from paper form to electronic files Objectives By the end of this session you will know how to: Determine how many variables you need Recognise different types of question Distinguish between data types Assign unique identifiers Create a structured and documented dataset Introduction This session is about transferring your questionnaire responses into a well-structured and well-documented dataset on the computer. This includes determining how many variables you need and assigning variable names and data types. The documentation for your dataset should be such that a user needs no further explanation for them to fully understand what the data are about. Assigning variables Before starting to enter data into the computer you must decide on the number of variables you need. You may be tempted to think in terms of one variable per question. This may be true for some simple questions, for example: 1. Sex of respondent Male [ ]1 Female [ ]2 However, many questions generate several pieces of information and, as a variable can only store one piece of information, you would need several variables for such questions. For example consider the following question: 3a. How many TIP packs in total did you get in this household in… October: ____________ SADC Course in Statistics January: ____________ Module I2 Session 1&2 – Page 1 Module I2 Session 1&2 November: ____________ February: ___________ December: ____________ This question generates 5 pieces of information and you would thus need 5 variables to store these data. The important point to remember is that a variable can store just one piece of information. Some questions generate several pieces of information and therefore need several variables. Naming variables Many packages (particularly older versions of packages) have restrictions on the length of variable names. A common restriction is no more than 8 characters. For ease of transfer between packages you are strongly advised to keep variable names to a maximum of 8 characters. With large questionnaires it can be difficult to think up enough unique variable names with just 8 characters, but there are some naming conventions you may find useful. One-up numbers In this system, each variable is numbered 1 through n, the total number of variables. Since most computer packages do not handle variable names starting with a digit, the usual format is V1 (or V0001) … Vn. This approach has the advantage of simplicity and the disadvantage of lack of information. Even though almost any standard package will allow extended labels for variables so that one can append information that V0023 is really “Q2. Age of respondent”, the number system is prone to error. Question numbers It is also possible to name variables corresponding to question numbers, e.g. Q1, Q2a, Q2b, … Qn. This approach has the advantage of relating directly to the original questionnaire, but, like one-up numbers, it has the disadvantage of not being easily remembered. Further, as we have already mentioned, it is not uncommon for a single question to generate several distinct variables. Mnemonic names SADC Course in Statistics Module I2 Session 1&2 – Page 2 Module I2 Session 1&2 As alluded to above, names chosen to represent the meaning of the actual variable have some advantages, principally in that they are recognisable and memorable. However, there are disadvantages. Firstly, what is an “obvious” abbreviation to the person who created it may not be obvious to a new user. Secondly, with only 8 characters to work with it is not all that easy to create names with immediately recognisable referents. Finally, it is difficult to maintain consistency across variables that share common content, e.g. always to use ED for education. Prefix, root, suffix systems A more systematic version of the previous point is to think of each variable name as containing a root, possibly a prefix, and possibly a suffix. For example, suppose all variables to do with education had the root ED. Mother’s education would then be MOED, father’s education FAED etc. Suffixes are often used to indication the wave of data in longitudinal studies, the form of a question, or other such information. Implementing this system requires prior planning to establish a list of standard two- or three-letter abbreviations. It is important to remember that the variable name is the referent that researchers use most often when working with the data. At a minimum it should not convey incorrect information and ideally it should be unambiguous in terms of content. Different types of question Keeping in mind that one variable is needed for each piece of information, we will look at different types and formats of question you might come across in a typical questionnaire and determine the number of variables needed in each case. Multiple Choice Multiple choice is very common in questionnaires and we have already seen an example of this above in the “Sex of respondent” question where the choices were “Male” or “Female”. In this type of question only one response is expected and therefore you would need only one variable to store the data. Consider the following example: 3. Marital status Single (never married) [ ]1 SADC Course in Statistics Module I2 Session 1&2 – Page 3 Module I2 Session 1&2 Married with husband/wife living in household [ ]2 Married with husband/wife temporarily living/working away [ ] 3 Divorced/separated [ ]4 Widow/Widower [ ]5 This is another multiple choice question and we expect just a single response so again we need just one variable for this piece of data. One point that is perhaps worth mentioning here is to make sure that the choices available in multiple choice questions are mutually exclusive. This is really something to think about at the questionnaire design stage but does have implications at the stage of assigning variables if it has not been thought through enough earlier on. Consider the “Marital status” question and let’s assume you wanted to include another choice which is “Cohabiting with partner [ ]6”. With this extra option a respondent who was separated from his/her first partner but living with a new partner could feasibly tick both options. You would then be faced with the decision about which response to record or whether to add an extra variable and record both responses. Multiple Response Consider the following question: 5. Which months did you not have cassava to eat from your own production? May 2002 [ ] September 2002 [ ] January 2003 [ ] June 2002 [ ] October 2002 [ ] February 2003 [ ] July 2002 [ ] November 2002 [ ] March 2003 [ ] August 2002 [ ] December 2002 [ ] April 2003 [ ] This question might only generate a single piece of information but at the other extreme it could generate 12 pieces of information. In the data file we need to allow for the maximum possible number of responses and therefore would need 12 variables for this question. Dates Dates are often a source of confusion in questionnaires mainly because of the different formats used in different parts of the world. In some countries the format is dd/mm/yyyy whereas in others the month is always put first. This means that 09/02/1959 could be interpreted as 9th February or 2nd September. There are also some countries where the year SADC Course in Statistics Module I2 Session 1&2 – Page 4 Module I2 Session 1&2 is put first. The interpretation of a date may also depend on the settings on your computer. To avoid any confusion we strongly recommend that dates are always split into three variables (fields) – day, month and year. It should be possible to include checks on the variables so that the day is between 1 and 31, the month is between 1 and 12 and the year is within a fixed range. A single date field can be generated afterwards using the software and you should also be able to include checks for invalid dates so for example dates such as 30th February would not be allowed. It is also recommended that you always use 4 digits for the year. In some cases you are not interested in the exact date of something but just the month and the year. For example: 3a. In which month did the maize run out? Year Month Not applicable [ ] For the question above we would need 3 variables – the first would contain the month and would be a value between 1 and 12, the second would be for the year and we would need a third variable to hold the response from the “Not applicable” check box. Dealing with “Tables” in the questionnaire In this context table refers to a grid of rows and columns generally provided for the answers to a single question in the questionnaire. There are two possible types of table in a questionnaire. The first is where the number of rows is fixed. For this type of table both the rows and the columns are uniquely labelled. Consider the following question: 3. How much did you produce in the last harvest (2002) for the following crops? Crop name Quantity produced Unit of measurement Number of Kg per unit Tobacco European potatoes Millet Sorghum SADC Course in Statistics Module I2 Session 1&2 – Page 5 Module I2 Session 1&2 This table has 12 cells in total (4 rows x 3 columns) and the number of rows is fixed at 4. The easiest way to deal with this table is to assign 12 unique variables for the responses – one for each cell in the table. Now consider the following question: 8. Now, I am going to ask you about each member of the household who has a garden. Please specify for each member of the household who cultivates a garden Individual Size of garden Received a TIP pack this Received a TIP pack last member number cultivated (acres) year (2002-03 main year (2001-02 main season)? season)? Yes [ ]1 No [ ]2 Yes [ ]1 No [ ]2 Yes [ ]1 No [ ]2 Yes [ ]1 No [ ]2 Yes [ ]1 No [ ]2 Yes [ ]1 No [ ]2 Yes [ ]1 No [ ]2 Yes [ ]1 No [ ]2 Yes [ ]1 No [ ]2 Yes [ ]1 No [ ]2 In this table we may have responses from one or more household members – the number of rows is not fixed. Each row of data in this table is about a different household member and the number of household members is not fixed. The data is at a different “level” to the data in the rest of the questionnaire and must therefore be stored in a separate file or a separate worksheet. A later session in this course goes into more detail about dealing with data at different levels but for now just keep in mind that when the number of rows is not fixed you will need to store the data separately. Other, please specify Many questions have code boxes listing all the options. It is often difficult to cover all possible responses in the codes and in many cases a “catch-all” option of “Other” is included. For example: From where did you buy your fertiliser? <Enter code from code box 01> [ __ ] If other, please specify: _________________________________________ Code Box 01 – Market Source SADC Course in Statistics Module I2 Session 1&2 – Page 6 Module I2 Session 1&2 1=TIP from voucher 4=Gift 7=Purchased from other farmer 2=TIP purchases 5=Purchased from Admarc 8=Project/NGO 3=Own (produced on farm) 6=Purchased from private trader 9=Other, specify For this question you would need to allow 2 variables. The first would contain the relevant code matching the response and the second would be the text response relating to cases where code 9 was used. Note that when the codes 1 to 8 are used the text variable will be empty. This may lead you to ask why we can’t simply store either the code 1 to 8 or the text response in the same variable. This is because we cannot have different data types in the same variable and this leads us onto the next topic. Data types Once you have determined how many variables you need, you then need to think about the data types for each of the variables. In some software packages – spreadsheets for example – it is possible to have both number and text in the same column. However, this is not recommended and most statistics packages and databases don’t allow this. When using spreadsheets you will need to discipline yourself not to mix data types in a column otherwise you will encounter problems when you come to transfer the data to other packages. Numeric data The most common type of data in quantitative research is numeric data and this is the easiest data to analyse. Consider the following question: 7. How much land is your household cultivating this season (2002-03) Acres in total? The response to this question could be almost any positive number including decimals. Decimal numbers bring their own problems with them particularly in the placing of the decimal point during recording and during data entry. Any outliers – that is extremely high or extremely low values – should be checked. It may be possible to assign upper and lower limits for such questions based on what is considered feasible. Now consider the following question: SADC Course in Statistics Module I2 Session 1&2 – Page 7 Module I2 Session 1&2 4. Household size (Number of people living in the household now, including children and servants) Clearly decimals are not valid for this question and although in theory almost any positive integer could be a valid response, in practice you are likely to be able to set an upper limit for this question. The final type of question for which we use numeric data is one where the possible responses have been coded. We have already seen some examples of this and below we show another example. 4. Was your income from sales of maize in 2002 … read the options …as previous year? More Less Same amount Don’t know [ ]1 [ ]2 [ ]3 [ ]4 Note in this example and in the earlier ones each possible response has been assigned a code. Here the codes were assigned during the questionnaire development but often this does not happen and you will need to assign codes yourself before transferring the data to computer. As part of the documentation for the dataset you will need to include the labels for the codes. In some statistics packages you can assign value labels to the variables; in spreadsheets you might want to consider having a separate worksheet in your data file which contains the codes and their labels; or you might have a separate document accompanying your dataset which acts as a data dictionary. Either way this information must be recorded both for your own benefit and the benefit of others. Text data Text data may be the responses from “Other, specify” questions, or general comments. You must ensure you have allocated enough space in your data file for the entire text string – sometimes long text strings are truncated when the data are transferred between packages. Although text is often difficult to analyse it is useful for capturing information that cannot easily be quantified. Occasionally, especially in longitudinal studies, it is necessary to record names and addresses of respondents. These data clearly need to be stored in text variables. However, for every day use you are advised to assign a unique code to each respondent and to use only this code in general data files. You would then have a separate file linking the names SADC Course in Statistics Module I2 Session 1&2 – Page 8 Module I2 Session 1&2 and addresses to the codes. This approach has the double advantage of making the data file easier to handle and of helping to safeguard the personal data (names and addresses). All personal data should be treated confidentially and only those who need to know the details – for example the interviewers – should be told. Under no circumstances should personal data be released into the public domain. Yes/No variables Consider the following question: 1. What food crops do you grow? (Tick all that apply) Maize [ ] Sorghum [ ] Rice [ ] Banana [ ] Millet [ ] Sweet potatoes [ ] Cassava [ ] European potatoes [ ] For this question we would need 8 variables – one for each possible response. For each crop the respondent can either tick the box or leave it empty – thus there are two possible values for each variable. In some database packages there is a data type called “Boolean”, also known in some cases as “Yes/No” data type. This data type is ideal for this sort of question. However, consider the following variation of this question: 1. What food crops do you grow? Maize: Yes [ ]1 No [ ]2 Sorghum: Yes [ ]1 No [ ]2 Rice: Yes [ ]1 No [ ]2 Banana: Yes [ ]1 No [ ]2 Millet: Yes [ ]1 No [ ]2 Sweet potatoes: Yes [ ]1 No [ ]2 Cassava: Yes [ ]1 No [ ]2 European potatoes: Yes [ ]1 No [ ]2 Here again we would need 8 variables to store the responses but this time the Boolean data type is not suitable as there are three possible responses for each crop; the “Yes” box can be ticked, the “No” box can be ticked, or both boxes can be left blank indicating a missing value. (If both boxes were ticked this would be a mistake). The Boolean data type does not distinguish between a “No” response and a missing value. For this question the variables would be numeric and the value would be restricted to 1 or 2. SADC Course in Statistics Module I2 Session 1&2 – Page 9 Module I2 Session 1&2 Unique Identifiers In every dataset there must be a way of uniquely identifying a particular record. This might be a single variable or a combination of variables. In databases this is often referred to as the “Primary key”. Earlier we talked about assigning a unique code to individual respondents in lieu of including names and addresses. In the same way you might have a unique code for each household, each plot of land, etc. As we have said the unique identifier might be a single variable or a combination of variables. Consider the following data set: VILLCODE HOUSENO RESPSEX AGEYRS HHHEAD ACTHEAD 1 1 1 31 1 1 2 1 40 2 1 3 2 45 1 1 4 1 60 1 2 5 1 29 1 2 6 2 25 2 1 2 7 2 20 2 2 1 For this set of data households are numbered uniquely so HOUSENO can be defined as the unique identifier. Now consider the following data set: VILLCODE HOUSENO RESPSEX AGEYRS HHHEAD ACTHEAD 1 1 1 31 1 1 2 1 40 2 1 3 2 45 1 1 4 1 60 1 2 1 1 29 1 2 2 2 25 2 1 2 3 2 20 2 2 1 Here households are numbered within villages and HOUSENO is not unique by itself. To identify a single record (household) you need both the village code and the household number. Thus the combination of VILLCODE and HOUSENO forms the unique identifier. SADC Course in Statistics Module I2 Session 1&2 – Page 10 Module I2 Session 1&2 Exercise On the following pages you will find six completed copies of Section B of the questionnaire each containing typical responses. Your task is as follows: Determine how many data files you would need for these data. Determine how many variables you would need to store the responses for this section. Create a table with the following columns: variable name, variable label, data type, value labels (codes). Indicate which variable or combination of variables would make the unique identifier – do you need to add any extra variables? Create an Excel workbook for these data – include a worksheet for each data file you identified in the first point above. Give appropriate names to the worksheets. Set appropriate validation rules and formats for the columns. Include in your workbook a worksheet containing a list of the codes to be used in the data. In a later session you will be shown how to link these codes with the data. SADC Course in Statistics Module I2 Session 1&2 – Page 11 Module I2 Session 1&2 SADC Course in Statistics Module I2 Session 1&2 – Page 12 Module I2 Session 1&2 SADC Course in Statistics Module I2 Session 1&2 – Page 13 Module I2 Session 1&2 SADC Course in Statistics Module I2 Session 1&2 – Page 14 Module I2 Session 1&2 SADC Course in Statistics Module I2 Session 1&2 – Page 15 Module I2 Session 1&2 SADC Course in Statistics Module I2 Session 1&2 – Page 16 Module I2 Session 1&2 SADC Course in Statistics Module I2 Session 1&2 – Page 17 Module I2 Session 1&2 Solution Data file for Households Variable Name HHID RESPSEX RESPAGE RESPAGGP MARITAL Label Household ID Sex of respondent Age of respondent Age group of respondent Marital status Data type Numeric (Integer) Numeric (Byte) Numeric (Byte) Numeric (Byte) Numeric (Byte) HHSIZE HHHEAD Household size Are you the head of the household? Are you the acting head? What is the sex of the head of household? How much land is your household cultivating this season? How much dambo/irrigated land does your household have? How much land does your household have in total? Numeric (Byte) Number (Byte) Value labels N/A 1=Male, 2=Female N/A 1=Young, 2=Middle aged, 3=Old 1=Single (never married), 2=Married with husband/wife living in household, 3=Married with husband/wife temporarily living/working away, 4=Divorced/separated, 5=Widow/widower N/A 1=Yes, 2=No Number (Byte) Number (Byte) 1=Yes, 2=No 1=Male, 2=Female Numeric (Decimal) N/A Numeric (Decimal) N/A Numeric (Decimal) N/A DEFACTO HEADSEX CULTLAND DAMBOLND TOTLAND The unique identifier for the first table is HHID. Data file for Household members Variable Name HHID INDIVID GARDSIZE TPTHISYR TPLASTYR Label Household ID Individual member number Size of garden cultivated (acres) Received a TIP pack this year? Received a TIP pack last year? Data type Numeric (Integer) Numeric (Byte) Numeric (Decimal) Numeric (Byte) Numeric (Byte) Value labels N/A N/A N/A 1=Yes, 2=No 1=Yes, 2=No The unique identifier for the second table is the combination of HHID and INDIVID. This assumes individuals will be counted within households. SADC Course in Statistics Module I2 Session 1&2 – Page 18