90-776 Manipulation of Large Data Sets Lab 4 April 7, 1999 Major Skills covered in today’s lab: Brining in comma delimited data Bringing in column data Today’s hints: A) Check out the new additional office hours: coming this Sunday and every Sunday until the end of the month: 2-4 in A100. B) If you haven’t turned in your project proposal, do so right away! I. Bringing in Data Today we’ll get some practice bringing in some data that isn’t as nice and neat as the data we have been using in the past. The data set l:\academic\90776\data\text\country.txt contains information for 93 countries. The file contains the following variables: country name of the country totalpop total population in the country capital capital of the country cappop population of the capital city region region country is in yr_ind year of country’s independence yr_un year the country entered the United Nations religion religion (if reported) lang1 major language lang2 other major language lang3 other major language 1) Before you can bring the data in, you need to get a sense of what the data looks like and how it is delimited. So go ahead and look at the data (just double-click on the file name to open it in Wordpad). How is the data delimited? How are missing values represented? Do you see any potential problems? 2) Write the code to bring the data into a temporary SAS data set called C1. Remember to tell SAS what the delimiter is in the INFILE statement. (Did you remember to tell SAS which variables are character variables?) 3) Let’s make sure that we did everything properly. Look at your log file and include PRINT and CONTENTS procedures. (The data set is small enough that we can do a PROC PRINT.) Are there 93 observations? What went wrong? 4) We have a problem. Some countries have one language listed, some have two listed, and some have three listed. However, for countries with fewer than three languages, we do not have “place holders” in SAS telling SAS that there is a missing value. For example, Afghanistan has two languages listed. For the third language, SAS reads in the next observation: Algeria. Unfortunately, Algeria is the next country, not the third language. If you can fix this problem with SAS code, please do so and show me how you did it. Otherwise, let’s go back and fix the data so that we can use it. 5) Open l:\academic\90776\data\text\country.txt in Excel - you will need to tell Excel that the file is delimited with commas. Highlight the last two columns (the last 2 language variables) and select Edit, Replace. Leave Find what: blank and enter a dash (-) in the Replace with: field. This will convert all of the missing values to dashes (the missing value indicator used in the rest of the data set) for those two variables. Now save this Excel file as a CSV (comma delimited) file on your own disk. 6) Now include a new data step in your program to read in the new comma delimited data set into a temporary SAS data set called C2. 7) Again, perform the PRINT and CONTENTS procedures. What has SAS done to the character variables? We can see from the CONTENTS output that all of the variables are at the default length of 8. 8) We know how to fix that problem – we need to add INFORMATS to the character variables that exceed 8 bytes. Let’s create a permanent SAS data set that again reads in our new comma delimited ASCII data set, but this time include informats. With this comma delimited data, we need to either include an INFORMAT statement for the variables we want to format, or we can include the INFORMAT in the INPUT statement by following the variable name by a colon and the informat. 9) Again do the PRINT and CONTENTS procedures. Note that since some of our variables are now longer, SAS has wrapped our data onto additional lines. To avoid this wrapping, we can tell SAS to make the page wider. The page width is set with the LS option in an OPTIONS statement – 150 should be wide enough: OPTIONS LS=150; Also, notice that SAS is putting page breaks in the middle of our data set. We can rectify this by changing the page length. The option for that is PS. 10) Notice your log file – SAS is giving you an “invalid data” note for some of the numeric variables. This is because the missing values are coded as “?” and “-” . These are characters – SAS is giving you the note because it doesn’t like to read in characters for numeric variables. This isn’t really a problem, and SAS just converts the ?s and –s to periods (how missing value are stored for numeric variables). However, all of those messages in the log file are annoying. To suppress the error messages, add a “??” after the variables (see pp288-89 in the text). So go ahead and add ?? after all of the numeric variables in your INPUT statement from part 8 and check your log file again. The annoying messages should be gone. We now should have read the data in properly, but we still need to do some cleaning. We’ll save that for homework 4. II. Column data Try problem 12-5 on page 295 of your text. The “solution” is at the back of the book.