Transferring data from paper form to electronic files

advertisement
Module I2 Session 1&2
Module I2 Session 1 and 2
Transferring data from paper form to
electronic files
Objectives
By the end of this session you will know how to:





Determine how many variables you need
Recognise different types of question
Distinguish between data types
Assign unique identifiers
Create a structured and documented dataset
Introduction
This session is about transferring your questionnaire responses into a well-structured and
well-documented dataset on the computer. This includes determining how many variables
you need and assigning variable names and data types. The documentation for your dataset
should be such that a user needs no further explanation for them to fully understand what
the data are about.
Assigning variables
Before starting to enter data into the computer you must decide on the number of variables
you need. You may be tempted to think in terms of one variable per question. This may
be true for some simple questions, for example:
1. Sex of respondent
Male [ ]1 Female [ ]2
However, many questions generate several pieces of information and, as a variable can only
store one piece of information, you would need several variables for such questions. For
example consider the following question:
3a. How many TIP packs in total did you get in this household in…
October:
____________
SADC Course in Statistics
January:
____________
Module I2 Session 1&2 – Page 1
Module I2 Session 1&2
November: ____________ February: ___________
December: ____________
This question generates 5 pieces of information and you would thus need 5 variables to
store these data.
The important point to remember is that a variable can store just one piece of information.
Some questions generate several pieces of information and therefore need several variables.
Naming variables
Many packages (particularly older versions of packages) have restrictions on the length of
variable names. A common restriction is no more than 8 characters. For ease of transfer
between packages you are strongly advised to keep variable names to a maximum of 8
characters. With large questionnaires it can be difficult to think up enough unique variable
names with just 8 characters, but there are some naming conventions you may find useful.
One-up numbers
In this system, each variable is numbered 1 through n, the total number of variables. Since
most computer packages do not handle variable names starting with a digit, the usual
format is V1 (or V0001) … Vn. This approach has the advantage of simplicity and the
disadvantage of lack of information. Even though almost any standard package will allow
extended labels for variables so that one can append information that V0023 is really “Q2.
Age of respondent”, the number system is prone to error.
Question numbers
It is also possible to name variables corresponding to question numbers, e.g. Q1, Q2a,
Q2b, … Qn. This approach has the advantage of relating directly to the original
questionnaire, but, like one-up numbers, it has the disadvantage of not being easily
remembered. Further, as we have already mentioned, it is not uncommon for a single
question to generate several distinct variables.
Mnemonic names
SADC Course in Statistics
Module I2 Session 1&2 – Page 2
Module I2 Session 1&2
As alluded to above, names chosen to represent the meaning of the actual variable have
some advantages, principally in that they are recognisable and memorable. However, there
are disadvantages. Firstly, what is an “obvious” abbreviation to the person who created it
may not be obvious to a new user. Secondly, with only 8 characters to work with it is not
all that easy to create names with immediately recognisable referents. Finally, it is difficult
to maintain consistency across variables that share common content, e.g. always to use ED
for education.
Prefix, root, suffix systems
A more systematic version of the previous point is to think of each variable name as
containing a root, possibly a prefix, and possibly a suffix. For example, suppose all
variables to do with education had the root ED. Mother’s education would then be
MOED, father’s education FAED etc. Suffixes are often used to indication the wave of
data in longitudinal studies, the form of a question, or other such information.
Implementing this system requires prior planning to establish a list of standard two- or
three-letter abbreviations.
It is important to remember that the variable name is the referent that researchers use most
often when working with the data. At a minimum it should not convey incorrect
information and ideally it should be unambiguous in terms of content.
Different types of question
Keeping in mind that one variable is needed for each piece of information, we will look at
different types and formats of question you might come across in a typical questionnaire
and determine the number of variables needed in each case.
Multiple Choice
Multiple choice is very common in questionnaires and we have already seen an example of
this above in the “Sex of respondent” question where the choices were “Male” or “Female”. In
this type of question only one response is expected and therefore you would need only one
variable to store the data. Consider the following example:
3. Marital status
Single (never married) [ ]1
SADC Course in Statistics
Module I2 Session 1&2 – Page 3
Module I2 Session 1&2
Married with husband/wife living in household [ ]2
Married with husband/wife temporarily living/working away [ ] 3
Divorced/separated [ ]4
Widow/Widower [ ]5
This is another multiple choice question and we expect just a single response so again we
need just one variable for this piece of data. One point that is perhaps worth mentioning
here is to make sure that the choices available in multiple choice questions are mutually
exclusive. This is really something to think about at the questionnaire design stage but
does have implications at the stage of assigning variables if it has not been thought through
enough earlier on. Consider the “Marital status” question and let’s assume you wanted to
include another choice which is “Cohabiting with partner [ ]6”. With this extra option a
respondent who was separated from his/her first partner but living with a new partner
could feasibly tick both options. You would then be faced with the decision about which
response to record or whether to add an extra variable and record both responses.
Multiple Response
Consider the following question:
5. Which months did you not have cassava to eat from your own production?
May 2002 [ ]
September 2002 [ ]
January 2003 [ ]
June 2002 [ ]
October 2002 [ ]
February 2003 [ ]
July 2002 [ ]
November 2002 [ ]
March 2003 [ ]
August 2002 [ ]
December 2002 [ ]
April 2003 [ ]
This question might only generate a single piece of information but at the other extreme it
could generate 12 pieces of information. In the data file we need to allow for the
maximum possible number of responses and therefore would need 12 variables for this
question.
Dates
Dates are often a source of confusion in questionnaires mainly because of the different
formats used in different parts of the world. In some countries the format is dd/mm/yyyy
whereas in others the month is always put first. This means that 09/02/1959 could be
interpreted as 9th February or 2nd September. There are also some countries where the year
SADC Course in Statistics
Module I2 Session 1&2 – Page 4
Module I2 Session 1&2
is put first. The interpretation of a date may also depend on the settings on your computer.
To avoid any confusion we strongly recommend that dates are always split into three
variables (fields) – day, month and year. It should be possible to include checks on the
variables so that the day is between 1 and 31, the month is between 1 and 12 and the year is
within a fixed range. A single date field can be generated afterwards using the software and
you should also be able to include checks for invalid dates so for example dates such as 30th
February would not be allowed. It is also recommended that you always use 4 digits for the
year.
In some cases you are not interested in the exact date of something but just the month and
the year. For example:
3a. In which month did the maize run out?
Year
Month
Not applicable [ ]
For the question above we would need 3 variables – the first would contain the month and
would be a value between 1 and 12, the second would be for the year and we would need a
third variable to hold the response from the “Not applicable” check box.
Dealing with “Tables” in the questionnaire
In this context table refers to a grid of rows and columns generally provided for the answers
to a single question in the questionnaire. There are two possible types of table in a
questionnaire. The first is where the number of rows is fixed. For this type of table both
the rows and the columns are uniquely labelled. Consider the following question:
3. How much did you produce in the last harvest (2002) for the following crops?
Crop name
Quantity produced
Unit of measurement
Number of Kg per unit
Tobacco
European potatoes
Millet
Sorghum
SADC Course in Statistics
Module I2 Session 1&2 – Page 5
Module I2 Session 1&2
This table has 12 cells in total (4 rows x 3 columns) and the number of rows is fixed at 4.
The easiest way to deal with this table is to assign 12 unique variables for the responses –
one for each cell in the table.
Now consider the following question:
8. Now, I am going to ask you about each member of the household who has a garden.
Please specify for each member of the household who cultivates a garden
Individual
Size of garden
Received a TIP pack this
Received a TIP pack last
member number
cultivated (acres)
year (2002-03 main
year (2001-02 main
season)?
season)?
Yes [ ]1 No [ ]2
Yes [ ]1 No [ ]2
Yes [ ]1 No [ ]2
Yes [ ]1 No [ ]2
Yes [ ]1 No [ ]2
Yes [ ]1 No [ ]2
Yes [ ]1 No [ ]2
Yes [ ]1 No [ ]2
Yes [ ]1 No [ ]2
Yes [ ]1 No [ ]2
In this table we may have responses from one or more household members – the number
of rows is not fixed. Each row of data in this table is about a different household member
and the number of household members is not fixed. The data is at a different “level” to
the data in the rest of the questionnaire and must therefore be stored in a separate file or a
separate worksheet. A later session in this course goes into more detail about dealing with
data at different levels but for now just keep in mind that when the number of rows is not
fixed you will need to store the data separately.
Other, please specify
Many questions have code boxes listing all the options. It is often difficult to cover all
possible responses in the codes and in many cases a “catch-all” option of “Other” is
included. For example:
From where did you buy your fertiliser? <Enter code from code box 01> [ __ ]
If other, please specify: _________________________________________
Code Box 01 – Market Source
SADC Course in Statistics
Module I2 Session 1&2 – Page 6
Module I2 Session 1&2
1=TIP from voucher
4=Gift
7=Purchased from other farmer
2=TIP purchases
5=Purchased from Admarc
8=Project/NGO
3=Own (produced on farm)
6=Purchased from private trader
9=Other, specify
For this question you would need to allow 2 variables. The first would contain the relevant
code matching the response and the second would be the text response relating to cases
where code 9 was used. Note that when the codes 1 to 8 are used the text variable will be
empty. This may lead you to ask why we can’t simply store either the code 1 to 8 or the
text response in the same variable. This is because we cannot have different data types in
the same variable and this leads us onto the next topic.
Data types
Once you have determined how many variables you need, you then need to think about the
data types for each of the variables. In some software packages – spreadsheets for example
– it is possible to have both number and text in the same column. However, this is not
recommended and most statistics packages and databases don’t allow this. When using
spreadsheets you will need to discipline yourself not to mix data types in a column
otherwise you will encounter problems when you come to transfer the data to other
packages.
Numeric data
The most common type of data in quantitative research is numeric data and this is the
easiest data to analyse. Consider the following question:
7. How much land is your household cultivating this season (2002-03)
Acres
in total?
The response to this question could be almost any positive number including decimals.
Decimal numbers bring their own problems with them particularly in the placing of the
decimal point during recording and during data entry. Any outliers – that is extremely high
or extremely low values – should be checked. It may be possible to assign upper and lower
limits for such questions based on what is considered feasible.
Now consider the following question:
SADC Course in Statistics
Module I2 Session 1&2 – Page 7
Module I2 Session 1&2
4. Household size (Number of people living in the household now,
including children and servants)
Clearly decimals are not valid for this question and although in theory almost any positive
integer could be a valid response, in practice you are likely to be able to set an upper limit
for this question.
The final type of question for which we use numeric data is one where the possible
responses have been coded. We have already seen some examples of this and below we
show another example.
4. Was your income from sales of maize in 2002 … read the options …as previous year?
More
Less
Same amount
Don’t know
[ ]1
[ ]2
[ ]3
[ ]4
Note in this example and in the earlier ones each possible response has been assigned a
code. Here the codes were assigned during the questionnaire development but often this
does not happen and you will need to assign codes yourself before transferring the data to
computer. As part of the documentation for the dataset you will need to include the labels
for the codes. In some statistics packages you can assign value labels to the variables; in
spreadsheets you might want to consider having a separate worksheet in your data file
which contains the codes and their labels; or you might have a separate document
accompanying your dataset which acts as a data dictionary. Either way this information
must be recorded both for your own benefit and the benefit of others.
Text data
Text data may be the responses from “Other, specify” questions, or general comments.
You must ensure you have allocated enough space in your data file for the entire text string
– sometimes long text strings are truncated when the data are transferred between
packages. Although text is often difficult to analyse it is useful for capturing information
that cannot easily be quantified.
Occasionally, especially in longitudinal studies, it is necessary to record names and
addresses of respondents. These data clearly need to be stored in text variables. However,
for every day use you are advised to assign a unique code to each respondent and to use
only this code in general data files. You would then have a separate file linking the names
SADC Course in Statistics
Module I2 Session 1&2 – Page 8
Module I2 Session 1&2
and addresses to the codes. This approach has the double advantage of making the data
file easier to handle and of helping to safeguard the personal data (names and addresses).
All personal data should be treated confidentially and only those who need to know the
details – for example the interviewers – should be told. Under no circumstances should
personal data be released into the public domain.
Yes/No variables
Consider the following question:
1. What food crops do you grow? (Tick all that apply)
Maize [ ]
Sorghum [ ]
Rice [ ]
Banana [ ]
Millet [ ]
Sweet potatoes [ ]
Cassava [ ]
European potatoes [ ]
For this question we would need 8 variables – one for each possible response. For each
crop the respondent can either tick the box or leave it empty – thus there are two possible
values for each variable. In some database packages there is a data type called “Boolean”,
also known in some cases as “Yes/No” data type. This data type is ideal for this sort of
question. However, consider the following variation of this question:
1. What food crops do you grow?
Maize: Yes [ ]1 No [ ]2
Sorghum: Yes [ ]1 No [ ]2
Rice: Yes [ ]1 No [ ]2
Banana: Yes [ ]1 No [ ]2
Millet: Yes [ ]1 No [ ]2
Sweet potatoes: Yes [ ]1 No [ ]2
Cassava: Yes [ ]1 No [ ]2
European potatoes: Yes [ ]1 No [ ]2
Here again we would need 8 variables to store the responses but this time the Boolean data
type is not suitable as there are three possible responses for each crop; the “Yes” box can
be ticked, the “No” box can be ticked, or both boxes can be left blank indicating a missing
value. (If both boxes were ticked this would be a mistake). The Boolean data type does
not distinguish between a “No” response and a missing value. For this question the
variables would be numeric and the value would be restricted to 1 or 2.
SADC Course in Statistics
Module I2 Session 1&2 – Page 9
Module I2 Session 1&2
Unique Identifiers
In every dataset there must be a way of uniquely identifying a particular record. This might
be a single variable or a combination of variables. In databases this is often referred to as
the “Primary key”. Earlier we talked about assigning a unique code to individual
respondents in lieu of including names and addresses. In the same way you might have a
unique code for each household, each plot of land, etc.
As we have said the unique identifier might be a single variable or a combination of
variables. Consider the following data set:
VILLCODE
HOUSENO
RESPSEX
AGEYRS
HHHEAD
ACTHEAD
1
1
1
31
1
1
2
1
40
2
1
3
2
45
1
1
4
1
60
1
2
5
1
29
1
2
6
2
25
2
1
2
7
2
20
2
2
1
For this set of data households are numbered uniquely so HOUSENO can be defined as
the unique identifier. Now consider the following data set:
VILLCODE
HOUSENO
RESPSEX
AGEYRS
HHHEAD
ACTHEAD
1
1
1
31
1
1
2
1
40
2
1
3
2
45
1
1
4
1
60
1
2
1
1
29
1
2
2
2
25
2
1
2
3
2
20
2
2
1
Here households are numbered within villages and HOUSENO is not unique by itself. To
identify a single record (household) you need both the village code and the household
number. Thus the combination of VILLCODE and HOUSENO forms the unique
identifier.
SADC Course in Statistics
Module I2 Session 1&2 – Page 10
Module I2 Session 1&2
Exercise
On the following pages you will find six completed copies of Section B of the
questionnaire each containing typical responses. Your task is as follows:








Determine how many data files you would need for these data.
Determine how many variables you would need to store the responses for this
section.
Create a table with the following columns: variable name, variable label, data type,
value labels (codes).
Indicate which variable or combination of variables would make the unique
identifier – do you need to add any extra variables?
Create an Excel workbook for these data – include a worksheet for each data file
you identified in the first point above.
Give appropriate names to the worksheets.
Set appropriate validation rules and formats for the columns.
Include in your workbook a worksheet containing a list of the codes to be used in
the data. In a later session you will be shown how to link these codes with the data.
SADC Course in Statistics
Module I2 Session 1&2 – Page 11
Module I2 Session 1&2
SADC Course in Statistics
Module I2 Session 1&2 – Page 12
Module I2 Session 1&2
SADC Course in Statistics
Module I2 Session 1&2 – Page 13
Module I2 Session 1&2
SADC Course in Statistics
Module I2 Session 1&2 – Page 14
Module I2 Session 1&2
SADC Course in Statistics
Module I2 Session 1&2 – Page 15
Module I2 Session 1&2
SADC Course in Statistics
Module I2 Session 1&2 – Page 16
Module I2 Session 1&2
SADC Course in Statistics
Module I2 Session 1&2 – Page 17
Module I2 Session 1&2
Solution
Data file for Households
Variable Name
HHID
RESPSEX
RESPAGE
RESPAGGP
MARITAL
Label
Household ID
Sex of respondent
Age of respondent
Age group of respondent
Marital status
Data type
Numeric (Integer)
Numeric (Byte)
Numeric (Byte)
Numeric (Byte)
Numeric (Byte)
HHSIZE
HHHEAD
Household size
Are you the head of the
household?
Are you the acting head?
What is the sex of the head of
household?
How much land is your
household cultivating this
season?
How much dambo/irrigated
land does your household
have?
How much land does your
household have in total?
Numeric (Byte)
Number (Byte)
Value labels
N/A
1=Male, 2=Female
N/A
1=Young, 2=Middle aged, 3=Old
1=Single (never married), 2=Married with
husband/wife living in household,
3=Married with husband/wife temporarily
living/working away,
4=Divorced/separated, 5=Widow/widower
N/A
1=Yes, 2=No
Number (Byte)
Number (Byte)
1=Yes, 2=No
1=Male, 2=Female
Numeric (Decimal)
N/A
Numeric (Decimal)
N/A
Numeric (Decimal)
N/A
DEFACTO
HEADSEX
CULTLAND
DAMBOLND
TOTLAND
The unique identifier for the first table is HHID.
Data file for Household members
Variable Name
HHID
INDIVID
GARDSIZE
TPTHISYR
TPLASTYR
Label
Household ID
Individual member number
Size of garden cultivated (acres)
Received a TIP pack this year?
Received a TIP pack last year?
Data type
Numeric (Integer)
Numeric (Byte)
Numeric (Decimal)
Numeric (Byte)
Numeric (Byte)
Value labels
N/A
N/A
N/A
1=Yes, 2=No
1=Yes, 2=No
The unique identifier for the second table is the combination of HHID and INDIVID.
This assumes individuals will be counted within households.
SADC Course in Statistics
Module I2 Session 1&2 – Page 18
Download