Why use databases for data entry

advertisement
Module I2 Session 6
Why use databases for data entry?
Learning Objectives
At the end of this session students will be able to


Explain the advantages and disadvantages of using a database for data entry
Compare these advantages and disadvantages with the data entry tools explored in
the previous sessions.
Activities
Activity 1
The first activity of this session is to read the interview transcribed below and extract the
information that allows to answer the following question:
What are the advantages of using a database for data entry?
Activity 2
Compile a table that presents a comparison of advantages and disadvantages of doing data
entry using a Spreadsheet, CS-Pro and a database such as Access.
SADC Course in Statistics
Module I2 Session 6 – Page 1
Module I2 Session 6
An interview with a professional data manager with
experience in managing survey data
Can you give some examples of data entry software you might consider
using?
It would really depend on how much data I had and how complex were the data structures.
For example for very small, flat files i.e. files in which all the data are at the same level, e.g.
household level, or plot level, then it is possible to use spreadsheets such as MS-Excel for
data entry and many researchers do this. Excel has the advantage of being in widespread
use and easy to learn at a basic level. However its use of ease can also be its downfall as it
can be used without much thought going into the structure of the data and it is far too easy
to make mistakes.
Some sort of database package is better for any data beyond a few rows and columns.
With packages such as Epi-Info and CS-Pro, both of which have data entry facilities, you
need to first create a “data dictionary”. This is a good thing as it makes you think about
exactly how many variables you will need and what they should be called. You can also
create data entry forms and customise these to resemble your data collection sheets and
questionnaires to some extent.
MS-Access is the application I generally use particularly for data from large surveys where
the data are at two or three different levels, e.g. household data and individual data. As
with Epi-Info and CS-Pro you need to set up the “tables” before you can enter data. The
data entry forms that you can create are incredibly flexible.
In what way is Access better than Epi-Info or CS-Pro?
Access is basically much more flexible. In the other packages although you can customise
the data entry forms to some extent and set some consistency and range checks, there is
much more you can do with Access. Behind the data entry forms in Access you can create
numerous “event procedures”. These are sections of VBA code that run in response to
various “events” happening on the form. The “event” may be entering some data into a
field or clicking on a command button. You can use these procedures to check for
consistency or for setting up an automatic skip and fill system, etc.
SADC Course in Statistics
Module I2 Session 6 – Page 2
Module I2 Session 6
In an Access database everything is stored in a single file. With CS-Pro and I think this is
also the case with Epi-Info, you end up with different files for the data dictionaries (one
fore each level of data from what I can work out), different files for the forms, etc. Just
trying out a very simple example in CS-Pro I ended up with six different files all with
different extensions some of which clash with other applications on my PC. This makes it
very messy when you want to pass your application to someone else to use and it is very
easy to forget or lose one or more of the files. With Access it’s just a case of dealing with a
single file which is much easier. There are different “objects” within the Access database
file – e.g. tables (which store the data), queries (which ask questions of the data), forms
(used for data entry and editing) and reports (for summarising data) – but there is just one
single file (with the extension .mdb) so only one file to backup and/or pass to colleagues.
It is often the case, particularly in surveys, where you have “sub-tables” on the
questionnaire or data collection sheet. In Section B of the TIP questionnaire for example
question 8 asks about gardens owned by members of the household. Each household can
have several gardens so effectively the “garden” data is another level. In Access you can
easily include “sub-forms” on the data entry forms to accommodate these sub-tables. I feel
this should be possible in Epi-Info and CS-Pro and it may be but so far I haven’t managed
to do this. In CS-Pro I got the impression I would have to enter these data using a
separate form which is not very convenient and could result in this data being missed out.
Of course it may be that I just haven’t worked out how to do this in CS-Pro but I know it
is relatively straight-forward in Access.
Another feature that Access has but which I haven’t yet found in the other packages is
“option groups”. For example question 1 asks for the sex of the respondent and on the
questionnaire has Male 1 Female 2. In Access you can easily put an option group on the
form which is linked to the SEX variable and has a check box for Male and a check box for
Female.
What is the “most important” reason for using a relational database
package?
Well in my view it is the ability to create relationships or links between different levels of
the data. Of course when we talk about relationships here we do not mean in the statistical
sense – at this stage we are not interested in things such as whether the level of income is
related to the health of family members for instance. Here we are talking about being able
SADC Course in Statistics
Module I2 Session 6 – Page 3
Module I2 Session 6
to link for example data on individuals with data about the household. This is done by
having the unique identifier for the household level appearing in the individual level as well.
Of course we can do this in Excel – we can have data at the household level on one
worksheet and data at the individual level on another worksheet and the household ID
column in both worksheets. However, in Excel there is no way to ensure that the
household ID is unique at the household level or of making sure that every household ID
mentioned at the individual level actually exists. In Access on the other hand, you can set
Household ID to be the “primary key” at the household level. The primary key must be
unique for each record in the table. You can also set up a relationship between the
household table and the individuals table and enforce the relationship so that you cannot
enter an individual into a household that does not exist on the database. In Access this is
referred to as “referential integrity” and it basically helps to ensure the integrity of the data.
What steps would you follow to create a data entry system in Access?
1. The first step should always be to work out exactly how many levels of data you have
and what variables you need at each level. Give names to your variables and I always
suggest using a maximum of 8 characters for variable or field names. This is so that the
data can later be extracted to almost any statistical package for analysis. Make sure you
don’t have any duplicate variable names even at different levels – in one database I
worked on we had a variable called SEX at two different levels and this lead to a lot of
confusion later on when we came to merge and aggregate the data.
2. At the same time work out the data types for each variable – e.g. text, numeric, date,
etc. With numeric variables do you need to allow for decimals or is it a numeric code
in which case you can restrict entry to a limited set of values. For these variables I tend
to use “lookup” fields in which the field is linked to a small table of codes and values.
That way you have all the information in the database to interpret the codes.
3. Then you create the tables – there tends to be one table for each level of data although
for very large surveys there can be lots of tables at the same level – there is a limit of
255 variables in a single Access table. For large surveys I’ve often created a different
table for each section of the questionnaire and these tables are linked with a “one-toone” relationship – i.e. one record in one table links to exactly one record in the other
table. The link is generally via the primary key field.
4. After you create the tables you next create the relationships and thus the structure of
the database itself. Where possible you should set referential integrity on the
relationships.
SADC Course in Statistics
Module I2 Session 6 – Page 4
Module I2 Session 6
5. The next step is to create the data entry forms. As I’ve already said these are very
flexible in Access. This is probably one of the more “creative” aspects of Access
although as much as possible it is best to make your form resemble your data collection
sheet or questionnaire. Designing the layout of the form is done first and you can add
extra text and dividing lines where necessary. In the past I have created multi-lingual
databases with sets of forms in different languages. The forms are linked to the tables
in the database and that’s where the data are stored, but the labels on the forms can be
in any language. With large questionnaires you often reach the size limit of data entry
forms but using command buttons and code you can link from one form to the next
easily enough ensuring that you are keeping to the same record as you move between
forms. This would be like turning the page on the paper questionnaire.
6. Once the layout of the forms is complete the next stage is to check that the flow
through the form is correct. In Access the “tab order” which is the order in which you
would normally go through the variables on the form, would by default match with the
order that they were added to the form. This might result in you jumping all over the
place so this needs to be checked and if necessary the tab order can be adjusted.
7. I would then go through and put in any automatic skips. For example in Section B of
the TIP questionnaire question 5a asks Are you the head of the household? Question 5b
then asks If not are you the acting head?. Of course if the answer to 5a is “Yes” we would
want to skip over 5b. This we can do in Access using an event procedure (a piece of
VBA code) that checks the value that was entered in 5a and if necessary skips over
question 5b. If you have assigned a code for “Not applicable” this code could
automatically be inserted into 5b. This saves time during data entry and helps maintain
completeness and integrity of the data.
8. We can also include some consistency checks on the values entered. For example
question 7 asks How much land is your household cultivating this season? - question 10 asks
How much land does your household have in total? – you would expect the value given in
question 10 to be greater than that given in question 7 so you can include an event
procedure to check this and to give a customised message if there is a discrepancy.
9. Finally I would tend to build a “front-end” in the form of a “Main menu” form with
command buttons leading to the data entry forms. It is easy to get the main menu to
open as soon as you open the database.
SADC Course in Statistics
Module I2 Session 6 – Page 5
Module I2 Session 6
What other useful features does Access have?
The reports can be very useful not only as a way of summarising the data but they can be
used to pre-print additional data entry questionnaires. I’ve recently been working on a
database for a study in which there is an initial household interview where data are
collected about the household including a roster of household members. Then there will
be regularly visits to the household throughout the study when questions will be asked
about the household members. To save having to write out the names of the household
members each time – which can lead to lots of errors – I have included a set of reports to
print out the visit forms including the names of the household members. This is working
really well.
Are there any disadvantages to using Access?





Yes there are several. I guess an important one is the cost – both Epi-Info and CS-Pro
are free so cost is not a concern with them.
Access also has a very steep learning curve. Having said that though I believe the most
important thing to understand about data is its structure, i.e. the levels of data and how
the different levels are related. In my view this is an important concept to understand
even if you are just using a spreadsheet package. Access (and other database packages)
tend to make you think more about the structure of your data before you start and this
in my view is a good thing. I do accept though that Access is not the most straightforward package to learn to use – it’s taken me 12 years to get to my current level of
understanding and I’m still learning new things!
Access database files tend to expand exponentially in size! For some very large surveys
I have created databases which are 40 or 50MB just in their design – that’s before any
data are entered! Access tends to eat disk space in much the same way as I would eat
chocolate! This does cause problems if you need to send the database to others as they
are often much too large to attach to an email for example.
There is no inbuilt double-data entry checking procedure in Access. This is where EpiInfo comes in particularly handy as that includes a “Data Compare” facility which will
compare tables from Access databases – it does tend to be a bit slow on very large
tables but it does work if you give it long enough so is very useful.
Some people might find it a disadvantage that Access does not have facilities for
analysis. I know Epi-Info and CS-Pro include some statistical functions and some
people must find these useful. In Access you can do some simple summaries in queries
SADC Course in Statistics
Module I2 Session 6 – Page 6
Module I2 Session 6

and reports but nothing beyond that. However, in my view this is not a disadvantage
as I’ve a strong belief in using the correct package for each task and a suspicion of
packages that claim to do everything – I’m not trying to downplay Epi-Info and CS-Pro
with this comment as I believe these are good packages in their own way especially
given that they are freely available – but Access is primarily a database management
system and therefore managing data is what it does best.
I was a little concerned when I heard that in Access 2007 you can create fields that hold
more than one value. To me this goes against basic data management principles of
“one item per cell”. I haven’t fully investigated this yet but I think it may be something
that researchers should be careful of. On the other hand it does look as though
Microsoft have finally removed the default value of zero on numeric fields – I always
used to find this frustrating. If you missed removing the default values you could end
up with lots of false zeros in your datasets and as all researchers know a missing value
should never be treated as zero!
In conclusion…
I thoroughly enjoy using Access but if you have only a small dataset and not a great deal of
time then you might be better off using Epi-Info or CS-Pro (especially if you have a limited
budget). If you do opt for Access be prepared for a steep learning curve – when you
manage to do something clever with it, it is very rewarding.
SADC Course in Statistics
Module I2 Session 6 – Page 7
Download