SOP how I match genotype and phenotype data from Excel and SPSS files to analyze covariance (ANCOVA) Stefan Vormfelde In this SOP I describe, how I match genotype and phenotype data to analyze covariance ANCOVA in SPSS. I describe, how I match genotype and phenotype data from Excel files in an SPSS data file. Why a routine? For genotype-phenotype association analysis, SPSS needs genotype data and phenotype data in a common sav-file. I compose the respective data from Excel files into a common sav-file using an SPSS-routine. Advantages of using routines include the 1. reduction of mismatches, especially, when data come sorted by different orders 2. speed (sometimes) and 3. tracebility, when data sheets develop further. Why Excel? I prefer Excel to maintain best control working up genotype and phenotype data. SPSS can compose data from other formats. However, there’s commonly also a way to import them in Excel. Samples The procedure, I describe in this SOP, can be recalculated using these files: gSample_strata.xls (genotype samples in an Excel file) pSample_strata.xls (phenotype samples in an Excel file) match_Excel_data_in_SPSS_genotypes_and_phenotypes.xls (to prepare the routine’s syntax) match_Excel_data_in_SPSS_genotypes_and_phenotypes.sps (the final routine) repCovariance_dataSheet.spv (the documentation of the run) repCovariance_dataSheet.sav (the outputfile I desired) I prepare the Excel-files I restrict to a single header line in both Excel-files. However, this may not be necessary. The SPSS-routine can sort the data by more than one criterion, e.g. by subject and also by study center. To make use of them, I have to position the columns containing the sort criteria as the first columns. I sort these columns in the order I want to use the sort criteria: First column – first criterion, second column – second criterion, ... I adjust the sort criteria: The headers of the columns must match between the files, e.g. “Pat_ID”, “study_center”, … The values in the cells must also match between the files. The respective command will not match e.g. B_1 to “1” but only B_1 to B_1 and “1” to “1”. I prepare the sps-file (SPSS-routine) To prepare the routine file, I prepare the syntax in Excel-files and copy and paste it to spsfiles afterwards (match_Excel_data_in_SPSS_genotypes_and_phenotypes.sps). I follow the instructions in the first column of the xls-file. I may save the file. Finally, I mark and copy (ctrl+c) the boxed area. I open SPSS. Then I select the pull-down menu “file”, then “new” and “syntax”. I insert (rightclick+insert) the text. I may save the file. Ready to go. I execute the SPSS routine To execute the routine, I open match_Excel_data_in_SPSS_genotypes_and_phenotypes.sps. I select the pull-down menu “execute” (“Ausführen”) and then “all” (“Alle”). This opens an output file, where I can follow the process and where warnings and mistakes are documented. Warnings: When I run my sample-routine on my sample-files, I get warnings in the spv-file, which correctly hint to more than one subject with data but without a subject ID in the sample files. These warnings do not preclude usage of the resulting sav-file. I do not get more warnings. The last command stores the output as an spv-file, e.g. “repCovariance_dataSheet.spv”. I keep these spv-files for traceability. The routine stores the matched data file as an sav-file according to the last command line, e.g. “repCovariance_dataSheet.sav”. This is result I desired. I keep this sav-file for traceability. I can now proceed with genotype-phenotype association analysis in the sav-file.