Module I2 Session 8 Managing Data: Data Transfer Learning objectives At the end of this session students will be able to Explain the differences between the following formats in which data can be stored: csv, txt with fixed columns, txt separated by spaces, Excel, Word tables, Access, Transfer data stored in files with csv, txt with fixed columns, txt separated by spaces, Excel, Word tables, Access, into any of those formats. Prepare a data file in with a structure suitable for statistical analysis Explain the reasons why specialised data transfer software is sometimes used. Activity 1: Assessing the structure of data Ideally, manipulating data to get it into the required format should take little time at all. In practice, it can be a source of great frustration for researchers who are desperate to get results out having obtained and inputted their data, but find they need to alter its format. These two sessions seek to highlight potential problems and equip the user with techniques and ideas on how to transform data in different situations. The following data was presented in the report Energy Statistics 2004 published by the Central Bureau of Statistics of Botswana. The tables show the percentage distribution of households by principal source of energy source used for lighting in 1981, 1991 and 2001. You need to use these data to construct a dataset for the number of households that used different energy sources for lighting in 1991. This dataset should be suitable to be used in a statistical package. You may want to review the notes from Module B2 about how to organise data in a spreadsheet. Look at the tables below and Name three problems that make the current arrangement unsuitable for such purpose. SADC Course in Statistics Module I2 Session 8 – Page 1 Module I2 Session 8 SADC Course in Statistics Module I2 Session 8 – Page 2 Module I2 Session 8 The following picture shows the structure of a file that would be good for use in statistical work, where EnergySource is the source of energy, Year is the year for the data, Strata distinguishes urban from rural and NoHhls is the name for the variable that contains the number of households. Describe it and compare this structure with the table format in which they are presented a) How many columns will there be in the dataset? b) How many rows of data would you expect in the new dataset? This dataset follows some basic rules for structuring data for statistical analysis: The data should be organised in rows and columns. The topmost row of the dataset should contain the names of the variables. Each column contains data from one variable and all the data within that column are of the same type. SADC Course in Statistics Module I2 Session 8 – Page 3 Module I2 Session 8 Activity 2: Reading the data into Excel Data can be found in a variety of formats, usually referred to by the extension of the file. Some of the most common formats are: Format CSV: Comma Separated Variables Data are stored as clean text1 and each piece of data is separated by a comma. TXT: Text Data are stored as clean text in which variables are separated by spaces or tabs. When using this format it is important to determine whether a multiple spaces/tabs should be considered as a single separator or not. TXT: fixed width Data are stored as clean text in which variables are arranged in columns usually aligned to the right. XLS The standard format in which Excel stores its workbooks. A good format to transfer data between applications. MDB The standard format for Access database files. Statistical packages have developed their own data formats too. They are specific to a package and cannot always be read directly from other applications. Exercise 1 The file “lighting1991.txt” is a space separated version of the 1991 data in table 1.11. Open the file and try to foresee the problems you would have if you copy and paste the contents of the file into Excel. Now try copying the data for lighting into Excel using cut and paste. What problems do you see now? 1 Clean text includes Unicode characters without any formatting. Unicode is an industry standard designed to allow text and symbols from all of the writing systems of the world to be consistently represented and manipulated by computers. SADC Course in Statistics Module I2 Session 8 – Page 4 Module I2 Session 8 You will notice that the data only appears in the first column as Excel doesn’t recognise the existence of columns of data. One option to solve this problem would be to import the file in Excel using the menu File, Open and select lighting1991.txt. Then choose delimited (our columns are separated by spaces). Upon clicking next, we need to specify ‘space’ as our delimiter as well as tab. Click next and then finish and the data should appear as below: What has gone wrong? How many columns did you expect to have in the Excel spreadsheet? How many rows did you expect to have? Does every column contain only the data that you expected or have there been unwanted shifts? Propose ways of solving the problem in the source data file (lighting1991.txt). The option of re-arranging the data manually in Excel is not recommended, it would work for this dataset but it would not work for a large data set! SADC Course in Statistics Module I2 Session 8 – Page 5 Module I2 Session 8 Go back to Notepad and solve the problem and then follow the steps above to read it in correctly this time saving it when you finish as lighting1991.xls. Did it work? It may be useful to use underscore (_) instead of a space so that Excel recognised it as the contents of one cell, for example “Gas (LPG)” can be written “Gas_LPG”. Exercise 2 We have created the file lighting1991.csv. This is a way of solving the problems. Open it using a text editor (for example Notepad not Excel) and compare it with your solution. What happened with the dashes where the percentage was zero? What happened with the commas used for indicating thousands in the original table? How are spaces treated now? Open the data in Excel using the menu “File, Open”. Did it work this time? However the dataset does not have the desired structure. Run the demonstration “Data Manipulation I2 S8” and follow each step with your dataset to get it to the right shape. SADC Course in Statistics Module I2 Session 8 – Page 6 Module I2 Session 8 Exercise 3 Now that you have seen the demonstration, try doing it yourself with the lighting1991.csv dataset. When you have finished try doing it with the lighting data set stored in lighting2001.csv and the cooking.csv dataset. Transferring data to other statistical packages Excel is a very useful format to have data in as it is easily transferred to other packages. Here we consider other ways you can save data so that you can transfer files to and from statistical packages for it is in these that you can do the statistical analysis that you require. Well structured CSV files will be readable by most statistical packages, as well as space limited text files. However you must be aware of the need to check the structure of the data before attempting to read it in. A statistical package does not allow the flexibility that Excel provides for reading badly structured data, so you will soon realise that your data has been incorrectly imported. On the other hand, because of that, re-arranging badly structured data can be much more difficult in a statistical package than in Excel. There are some computer software packages you can buy that transfer data quickly and easily between different statistical packages such as Stat/Transfer. They have the advantage that they carry over not only the variables but also variable labels and any other label that has been created for the values of a specific variable. Conclusion This session has sought to highlight that data manipulation has a number of problems and that things can get complicated. It has also sought to establish that whilst it can be frustrating and also time-consuming, with a few techniques at your disposal, large problems can be broken down into small, manageable problems and data manipulated. These examples were designed to show you how awkward real data can be. They illustrate some of the common problems that you may come across whilst trying to read your data in. Here are some tips that you may find useful in the future: SADC Course in Statistics Module I2 Session 8 – Page 7 Module I2 Session 8 Be aware that spaces in the data can be problematic. If the data are separated by spaces check whether 2 spaces mean a missing value or can there be multiple spaces between values? A trick for text that is sometimes useful is to use underscore (_) instead of a space so that Excel recognises it as the contents of one cell. Check for non numerical symbols or characters in the data. Excel will accept them without problem, but your calculation may be affected. Statistical packages don’t always recognise words. Are there any unusual symbols? What do they mean? Here we have ‘-‘ which means 0. Yet often it means a missing value. Have the data been appropriately structured? In our example we had the two first rows that needed to be converted into columns. Distinctions have been drawn between useful formats (especially .txt, .csv, .xls) in the hope that you will now be able to move your data around to and from various packages and sources. Patience and attention to detail are key virtues here! Finally we hope you realise now how important it is to design datasets with a good structure. A lot of time, effort can be saved, as well as errors can be avoided if good data structures are prepared right from the start of a data management project. SADC Course in Statistics Module I2 Session 8 – Page 8