Module I2 Session 08

advertisement
Module I2 Session 8
Managing Data: Data Transfer
Learning objectives
At the end of this session students will be able to

Explain the differences between the following formats in which data can be
stored: csv, txt with fixed columns, txt separated by spaces, Excel, Word
tables, Access,

Transfer data stored in files with csv, txt with fixed columns, txt separated
by spaces, Excel, Word tables, Access, into any of those formats.
Prepare a data file in with a structure suitable for statistical analysis
Explain the reasons why specialised data transfer software is sometimes
used.


Activity 1: Assessing the structure of data
Ideally, manipulating data to get it into the required format should take little time at all. In
practice, it can be a source of great frustration for researchers who are desperate to get
results out having obtained and inputted their data, but find they need to alter its format.
These two sessions seek to highlight potential problems and equip the user with techniques
and ideas on how to transform data in different situations.
The following data was presented in the report Energy Statistics 2004 published by the
Central Bureau of Statistics of Botswana. The tables show the percentage distribution of
households by principal source of energy source used for lighting in 1981, 1991 and 2001.
You need to use these data to construct a dataset for the number of households that used
different energy sources for lighting in 1991. This dataset should be suitable to be used in a
statistical package. You may want to review the notes from Module B2 about how to
organise data in a spreadsheet.
Look at the tables below and

Name three problems that make the current arrangement unsuitable for
such purpose.
SADC Course in Statistics
Module I2 Session 8 – Page 1
Module I2 Session 8
SADC Course in Statistics
Module I2 Session 8 – Page 2
Module I2 Session 8
The following picture shows the structure of a file that would be good for use in statistical
work, where EnergySource is the source of energy, Year is the year for the data, Strata
distinguishes urban from rural and NoHhls is the name for the variable that contains the
number of households.
Describe it and compare this structure with the table format in which they are presented
a) How many columns will there be in the dataset?
b) How many rows of data would you expect in the new dataset?
This dataset follows some basic rules for structuring data for statistical analysis:



The data should be organised in rows and columns.
The topmost row of the dataset should contain the names of the variables.
Each column contains data from one variable and all the data within that column
are of the same type.
SADC Course in Statistics
Module I2 Session 8 – Page 3
Module I2 Session 8
Activity 2: Reading the data into Excel
Data can be found in a variety of formats, usually referred to by the extension of the file.
Some of the most common formats are:
Format
CSV: Comma Separated Variables
Data are stored as clean text1 and each piece of data is
separated by a comma.
TXT: Text
Data are stored as clean text in which variables are
separated by spaces or tabs. When using this format it is
important to determine whether a multiple spaces/tabs
should be considered as a single separator or not.
TXT: fixed width
Data are stored as clean text in which variables are arranged
in columns usually aligned to the right.
XLS
The standard format in which Excel stores its workbooks. A
good format to transfer data between applications.
MDB
The standard format for Access database files.
Statistical packages have developed their own data formats too. They are specific to a
package and cannot always be read directly from other applications.
Exercise 1
The file “lighting1991.txt” is a space separated version of the 1991 data in table 1.11. Open
the file and try to foresee the problems you would have if you copy and paste the contents
of the file into Excel.
Now try copying the data for lighting into Excel using cut and paste. What problems do
you see now?
1
Clean text includes Unicode characters without any formatting. Unicode is an industry standard designed to
allow text and symbols from all of the writing systems of the world to be consistently represented and
manipulated by computers.
SADC Course in Statistics
Module I2 Session 8 – Page 4
Module I2 Session 8
You will notice that the data only appears in the first column as Excel doesn’t recognise the
existence of columns of data. One option to solve this problem would be to import the file
in Excel using the menu File, Open and select lighting1991.txt.
Then choose delimited (our columns are separated by spaces). Upon clicking next, we
need to specify ‘space’ as our delimiter as well as tab. Click next and then finish and the
data should appear as below:
What has gone wrong?



How many columns did you expect to have in the Excel spreadsheet?
How many rows did you expect to have?
Does every column contain only the data that you expected or have there
been unwanted shifts?
Propose ways of solving the problem in the source data file (lighting1991.txt). The option
of re-arranging the data manually in Excel is not recommended, it would work for this
dataset but it would not work for a large data set!
SADC Course in Statistics
Module I2 Session 8 – Page 5
Module I2 Session 8
Go back to Notepad and solve the problem and then follow the steps above to read it in
correctly this time saving it when you finish as lighting1991.xls.
Did it work?
It may be useful to use underscore (_) instead of a space so that Excel recognised it as the
contents of one cell, for example “Gas (LPG)” can be written “Gas_LPG”.
Exercise 2
We have created the file lighting1991.csv. This is a way of solving the problems. Open it
using a text editor (for example Notepad not Excel) and compare it with your solution.


What happened with the dashes where the percentage was zero?
What happened with the commas used for indicating thousands in the original

table?
How are spaces treated now?
Open the data in Excel using the menu “File, Open”. Did it work this time?
However the dataset does not have the desired structure. Run the demonstration “Data
Manipulation I2 S8” and follow each step with your dataset to get it to the right shape.
SADC Course in Statistics
Module I2 Session 8 – Page 6
Module I2 Session 8
Exercise 3
Now that you have seen the demonstration, try doing it yourself with the lighting1991.csv
dataset.
When you have finished try doing it with the lighting data set stored in lighting2001.csv and
the cooking.csv dataset.
Transferring data to other statistical packages
Excel is a very useful format to have data in as it is easily transferred to other packages.
Here we consider other ways you can save data so that you can transfer files to and from
statistical packages for it is in these that you can do the statistical analysis that you require.
Well structured CSV files will be readable by most statistical packages, as well as space
limited text files. However you must be aware of the need to check the structure of the
data before attempting to read it in. A statistical package does not allow the flexibility that
Excel provides for reading badly structured data, so you will soon realise that your data has
been incorrectly imported. On the other hand, because of that, re-arranging badly
structured data can be much more difficult in a statistical package than in Excel.
There are some computer software packages you can buy that transfer data quickly and
easily between different statistical packages such as Stat/Transfer. They have the advantage
that they carry over not only the variables but also variable labels and any other label that
has been created for the values of a specific variable.
Conclusion
This session has sought to highlight that data manipulation has a number of problems and
that things can get complicated. It has also sought to establish that whilst it can be
frustrating and also time-consuming, with a few techniques at your disposal, large problems
can be broken down into small, manageable problems and data manipulated.
These examples were designed to show you how awkward real data can be. They illustrate
some of the common problems that you may come across whilst trying to read your data
in. Here are some tips that you may find useful in the future:
SADC Course in Statistics
Module I2 Session 8 – Page 7
Module I2 Session 8




Be aware that spaces in the data can be problematic. If the data are separated by
spaces check whether 2 spaces mean a missing value or can there be multiple
spaces between values? A trick for text that is sometimes useful is to use
underscore (_) instead of a space so that Excel recognises it as the contents of one
cell.
Check for non numerical symbols or characters in the data. Excel will accept them
without problem, but your calculation may be affected. Statistical packages don’t
always recognise words.
Are there any unusual symbols? What do they mean? Here we have ‘-‘ which
means 0. Yet often it means a missing value.
Have the data been appropriately structured? In our example we had the two first
rows that needed to be converted into columns.
Distinctions have been drawn between useful formats (especially .txt, .csv, .xls) in the hope
that you will now be able to move your data around to and from various packages and
sources. Patience and attention to detail are key virtues here!
Finally we hope you realise now how important it is to design datasets with a good
structure. A lot of time, effort can be saved, as well as errors can be avoided if good data
structures are prepared right from the start of a data management project.
SADC Course in Statistics
Module I2 Session 8 – Page 8
Download