DATA PROCESSING IN FSD - FORSDATA

DATA PROCESSING IN FSD Hannele Keckman-Koivuniemi, Research Officer, Finnish Social Science Data Archive Jouni Sivonen, Systems Analyst, Finnish Social Science Data Archive 1. Purpose of data processing ______________________________________________________ 1 2. Depositing data _______________________________________________________________ 2 3. User groups and security permissions _____________________________________________ 2 4. Data folders and files __________________________________________________________ 2 5. Processing quantitative data _____________________________________________________ 3 5.1. Data are processed intensively - levels of processing in preparation _______________________ 3 5.2.Checking the data _________________________________________________________________ 3 5.3. Data protection __________________________________________________________________ 4 5.4. Version Control __________________________________________________________________ 4 5.5. FSD Syntax Example ______________________________________________________________ 5 6. Processing qualitative data ______________________________________________________ 6 7. Data backup __________________________________________________________________ 7 8. TIIPII database _______________________________________________________________ 7 9. Data conversions and dissemination ______________________________________________ 8 10. Data publishing ______________________________________________________________ 8 1. Purpose of data processing FSD's goal is to make researchers and research communities understand the significance of a data archive, and to promote the archive as a valuable resource for research and teaching. FSD provides a wide range of services from data archiving to information services. Its primary goal is to increase the use of existing social science research data by disseminating it throughout Finland and also internationally The purpose of data processing is to enhance the usefulness of the datasets for secondary analysis. By processing the data intensively, we aim to make the content of the archived data correspond as closely as possible to that of the original questionnaire. 2. Depositing data The archiving process begins when the depositor delivers machine readable data to FSD. Virus checks will be carried out automatically on all received material. The data are most often received in SPSS or Excel format and occasionally in SAS or ASCII format. Depositors are asked to give detailed information about the collection procedure, resulting publications and the research project in general. FSD prefers to obtain the supplementary documentation both in electronic and paper formats. The completed and signed material description (http://www.fsd.uta.fi/english/forms/description.pdf) and the material deposit agreement forms (http://www.fsd.uta.fi/english/forms/deposit.pdf) are necessary as well. 3. User groups and security permissions FSD has a method for controlling who can do what to documents and data folders. The archive staff has been divided into different user groups with appropriate user rights, depending on the work they do. The current user groups are administrators, technical staff, data administrators, data processors, data documentators, data readers and others. Each group has only the security permissions (read, write, modify permissions) they need in their daily work. 4. Data folders and files All electronic information about a dataset is preserved in a dataset folder named FSDXXXX, where XXXX is the FSD's dataset identification number. The main folder of a dataset contains the following files: - codebook (cbFXXXX.pdf) - DDI documentation (study and document description = meFXXXX.xml and variable description = vaFXXXX.xml) - questionnaire (quFXXXX_fin.pdf) - and other document files. Main folder has a subfolder called Original, which contains all the original material received from the depositor. Another subfolder, called Data, has all the datafiles attached to this dataset: - daFXXXX.por - syFXXXX.sps - and possibly a source code for programs needed to convert the original data into a SPSS readable data format. 5. Processing quantitative data 5.1. Data are processed intensively - levels of processing in preparation FSD uses SPSS in data processing and preserves data in portable format. We have chosen SPSS because it provides - at least at present - the best option for preserving data and identification information (that is labels) in the same file. Since SPSS is widely used, converting material to future formats should be manageable. All national studies acquired by FSD are processed intensively - a very time-consuming undertaking. There are two reasons for doing this. First, the number of archived studies so far is manageable. Second, researchers do not usually offer their data. Rather, FSD staff identify studies from, for example, scientific journals and then request the data from the researchers. This way all archived studies are by presumption "important" and deserve intensive processing. However, three-level categorisation is under construction. The process described below is a “ level one” process. 5.2. Checking the data The original data file is preserved without modification as long as necessary for the archiving process. The original datasets are preserved so that data protection rules are not violated. A copy of the original file is used to produce a version suitable for secondary use. We focus on the content, not the "look and feel" of the studies. During the checking process mistakes are corrected and verifications, amendments and additions are made in the syntax file. We confirm that the number of variables and cases match the documentation supplied. We check frequencies to verify variables, valid and invalid values (e.g. missing data, not applicable responses etc.). Variables are renamed corresponding to question numbers. Background variables without their own question numbers are renamed bv1, bv2 etc. FSD has standardised labels of some background variables (e.g. original question: How old are you? --> variable label: Respondent's age). Variable and value labels are constructed based on the questionnaire. Long question texts are shortened but still variable labels often become quite long. We try to keep technical naming conventions or variable names and labels consistent within studies of the same series. Questionnaires often include questions that are directed only at respondents who meet certain requirements. The archive checks these filter conditions. If the data include answers from people who do not belong to the specified target group, the responses are classified as missing data. In order to make sure that the content of the archived data corresponds as closely as possible to that of the questionnaire some variables may have to be dropped or added. A variable is dropped if it is undefined or data security aspects so require. Constructed variables, such as combined variables and sum variables, are usually dropped. However, those constructed variables that are integral to the usability of the data, especially weight variables, are kept - providing that the documentation provided is explicit enough. Nowadays we add three variables to every dataset processed in FSD: fsd_no, fsd_vr and fsd_id. Other new variables are added only if usability so requires. 5.3. Data protection Confidentiality aspects require that personal data are deleted. It is recommended that depositors remove these types of data (names, addresses, birthdays etc.) before delivering the material to the archive. Under certain circumstances, FSD stores materials which contain personal data. In these cases, the archive anonymizes the data before dissemination according to its own guidelines and depositors' instructions. Variables indicating place of residence and business are also problematic. On one hand there is always a risk that a single respondent might be identified, on the other hand the deletion of these types of variables prevents secondary users from conducting regional comparisons, especially if no other regional variables are used. Usually these types of variables are dropped. If necessary, they can be restored. Variables of larger regional units (provinces, districts) are kept. 5.4. Version Control We aim to process the datasets only once. Still, some of them have to be reprocessed. Sometimes additional information about variables is given by depositors, errors are detected or processing procedures updated. For example, datasets processed in the early days of FSD already need repairing. The first final version of a dataset is called version 1.0. If subsequent changes are more or less cosmetic (typing errors etc.) the new version will be called 1.1. In the case of significant changes (e.g. a variable added) the new version will be named 2.0. We add these alterations to the end of the same SPSS syntax file used in processing the dataset in the first place. This is not a long-term solution but has worked so far. 5.5. FSD Syntax Example * FSD Finnish Social Science Data Archive. * FSDXXXX Study Title Year. * SPSS-syntax. * Version 1.0: month year (research officer). * Version 2.0: date month year (research officer). * Warnings:. * Check the saving paths before running the syntax! . *. * The original data file is g-zipped. * Unzip it by double clicking the filename. IMPORT FILE='X:\Data\FsdXXXX\original\nnnn.por'. EXECUTE. * or. * GET FILE= 'X:\Data\FsdXXXX\original\nnnn.sav'. * EXECUTE. SAVE OUTFILE='C:\Temp\daFXXXX.sav'. GET FILE='C:\Temp\daFXXXX.sav'. EXECUTE. COMPUTE fsd_no = XXXX. FORMATS fsd_no (F5.0). EXE. COMPUTE fsd_vr = 1.0. FORMATS fsd_vr (F5.1). EXECUTE. COMPUTE fsd_id = $CASENUM. FORMATS fsd_id (F9.0). EXECUTE. RENAME VARIABLES (asuin1=q1) (sukup2=q2) (ika3=q3) (perhes4=q4). EXECUTE. VARIABLE LABELS fsd_no "[fsd_no] FSD study number" /fsd_vr "[fsd_vr] FSD edition number" /fsd_id "[fsd_id] FSD case id" /q1 "[q1] Respondent's municipality of residence" /q2 "[q2] Respondent's sex" /q3 "[q3] Respondent's age" /q4 "[q4] Respondent's marital status". EXECUTE. VALUE LABELS q1 1 "Kajaani" 2 "Laihia" 3 "Loviisa" 4 "Muonio" 5 "Puumala" 6 "Valkeakoski" /q2 1 "male" 2 "female" /q4 1 "single" 2 "married or cohabiting" 3 "widowed or separated". EXECUTE. RECODE q8_1 (8=SYSMIS). EXECUTE. RECODE q11 ('?'=' '). EXECUTE. MISSING VALUES q8_1 (0). EXECUTE. SAVE OUTFILE='C:\Temp\daFXXXX.sav'/ KEEP= q1 TO q58. EXECUTE. GET FILE='C:\Temp\daFXXXX.sav'. EXECUTE. *. * ## Syntax of version 1.0 ends here. *. * Syntax of version 2.0 starts here. * Commands and comments needed here. *. * ## Syntax of version 2.0 ends here. *. EXPORT OUTFILE='X:\Data\FsdXXXX\data\daFXXXX.por'. EXECUTE. 6. Processing qualitative data FSD started to archive qualitative data in spring 2003. The received data are mainly in electronic format, usually transliterated speech or interaction. We usually preserve the anonymized data in rich text (.rtf), unformatted text (.txt) or HTML format. Hard copy material will be scanned and preserved as TIFF images. 7. Data backup Data are stored on a network drive, and the size of a hard disk partition available for datasets is currently 40 gigabytes. To ensure preservation in case of a hard disk crash, the whole disk is mirrored to a another hard disk with software RAID (Redundant Array of Independent/Inexpensive Disks, Level 2). In addition, modified data are copied every night (working day) to a 3rd hard disk and backupped to DDS-3 (Digital Data Storage) tapes. Backup tapes are rotated after 12 days. We also store 12 monthly backup tapes (taken on the first working day of each month). Backup tapes are stored in a locked fireproof safe. Twice a year a data backup tape is sent to the Finnish National Archives. 8. TIIPII database FSD has used a relational database called Tiipii for storing operational information from the very beginning. First prototype was built with MS Access 97, and new functions have been added about every 6 months. A project aiming to create a new user interface and change the database started in fall 2002. The new user interface was created with Java and the database was changed to PostgreSQL. Use of the new version started this summer. Tiipii2 database has information on all important data archive functions: 1. Archived dataset: name, number, type, level of processing, version, history, publishing date, depositor(s), conditions for use, series information, stage of data processing and documentation, files connected to the dataset 2. Access agreements: research project, financier, purpose of use, delivery format, information on data user(s) 3. Dissemination material: date, type, list of files, cd-rom covers, covering letter 4. Access conditions: permissions from depositors 5. Customer database: contact information 6. Contacts to FSD and made by FSD: phone calls, e-mails, date, contact person, contact chain 7. Visits to FSD and visits made by FSD: name of the event, dates and times, number of people attending, contact information, related visits 8. Folders: paper materials related to a dataset More information and screenshots at: http://www.iassistdata.org/conferences/2004/presentations/C3_sivonen.ppt 9. Data conversions and dissemination Most of our quantitative data customers are familiar with SPSS. About 90 % of researchers wish to have data in this format. The rest wish to have data delivered in Excel or SAS formats. Conversions between different statistical packages are done with Stattransfer software. In addition, we have one Macintosh computer for conversions from the Mac file systems. Qualitative datafiles are disseminated in RTF, TXT and HTML formats. Software programs for qualitative analysis require text formats, but HTML files (with style sheet documents) are more illustrative for the researcher. We deliver hard copy materials as PDF files. 10. Data publishing Using tools created in the archive, FSD’s data documentation is published on the web both in Finnish and in English. DDI format XML files are converted to HTML (www) and PDF files (codebook), and copied to the www server. Our data documentation is also published in the NESSTAR system. Two DDI xml files containing Document and Study Descriptions (meFXXXX.xml) plus File & Variable Descriptions (vaFXXXX.xml) are combined into one XML file and copied to the Nesstar server. We use NSD's command line tools NSDImpNesstar and XMLGenerator for extracting variable information from SPSS portable files. In the near future Nesstar Publisher will probably replace the command line tools and text editor based DDI documenting. See also http://www.fsd.uta.fi/~matti/prujut/Koln2001.pdf http://www.fsd.uta.fi/~matti/prujut/Bukarest2002.pdf

DATA PROCESSING IN FSD - FORSDATA

Related documents

Products

Support

DATA PROCESSING IN FSD - FORSDATA

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib