90-776 Manipulation of Large Data Sets Lab 4 April 7, 1999

advertisement
90-776 Manipulation of Large Data Sets
Lab 4
April 7, 1999
Major Skills covered in today’s lab:
 Brining in comma delimited data
 Bringing in column data
Today’s hints:
A) Check out the new additional office hours: coming this Sunday and every Sunday
until the end of the month: 2-4 in A100.
B) If you haven’t turned in your project proposal, do so right away!
I.
Bringing in Data
Today we’ll get some practice bringing in some data that isn’t as nice and neat as the data
we have been using in the past. The data set l:\academic\90776\data\text\country.txt
contains information for 93 countries.
The file contains the following variables:
country
name of the country
totalpop
total population in the country
capital
capital of the country
cappop
population of the capital city
region
region country is in
yr_ind
year of country’s independence
yr_un
year the country entered the United Nations
religion
religion (if reported)
lang1
major language
lang2
other major language
lang3
other major language
1) Before you can bring the data in, you need to get a sense of what the data looks like
and how it is delimited. So go ahead and look at the data (just double-click on the file
name to open it in Wordpad). How is the data delimited? How are missing values
represented? Do you see any potential problems?
2) Write the code to bring the data into a temporary SAS data set called C1. Remember
to tell SAS what the delimiter is in the INFILE statement. (Did you remember to tell
SAS which variables are character variables?)
3) Let’s make sure that we did everything properly. Look at your log file and include
PRINT and CONTENTS procedures. (The data set is small enough that we can do a
PROC PRINT.) Are there 93 observations? What went wrong?
4) We have a problem. Some countries have one language listed, some have two listed,
and some have three listed. However, for countries with fewer than three languages,
we do not have “place holders” in SAS telling SAS that there is a missing value. For
example, Afghanistan has two languages listed. For the third language, SAS reads in
the next observation: Algeria. Unfortunately, Algeria is the next country, not the
third language. If you can fix this problem with SAS code, please do so and show me
how you did it. Otherwise, let’s go back and fix the data so that we can use it.
5) Open l:\academic\90776\data\text\country.txt in Excel - you will need to tell Excel
that the file is delimited with commas. Highlight the last two columns (the last 2
language variables) and select Edit, Replace. Leave Find what: blank and enter a
dash (-) in the Replace with: field. This will convert all of the missing values to
dashes (the missing value indicator used in the rest of the data set) for those two
variables. Now save this Excel file as a CSV (comma delimited) file on your own
disk.
6) Now include a new data step in your program to read in the new comma delimited
data set into a temporary SAS data set called C2.
7) Again, perform the PRINT and CONTENTS procedures. What has SAS done to the
character variables? We can see from the CONTENTS output that all of the variables
are at the default length of 8.
8) We know how to fix that problem – we need to add INFORMATS to the character
variables that exceed 8 bytes. Let’s create a permanent SAS data set that again reads
in our new comma delimited ASCII data set, but this time include informats. With
this comma delimited data, we need to either include an INFORMAT statement for
the variables we want to format, or we can include the INFORMAT in the INPUT
statement by following the variable name by a colon and the informat.
9) Again do the PRINT and CONTENTS procedures. Note that since some of our
variables are now longer, SAS has wrapped our data onto additional lines. To avoid
this wrapping, we can tell SAS to make the page wider. The page width is set with
the LS option in an OPTIONS statement – 150 should be wide enough: OPTIONS
LS=150; Also, notice that SAS is putting page breaks in the middle of our data set.
We can rectify this by changing the page length. The option for that is PS.
10) Notice your log file – SAS is giving you an “invalid data” note for some of the
numeric variables. This is because the missing values are coded as “?” and “-” .
These are characters – SAS is giving you the note because it doesn’t like to read in
characters for numeric variables. This isn’t really a problem, and SAS just converts
the ?s and –s to periods (how missing value are stored for numeric variables).
However, all of those messages in the log file are annoying. To suppress the error
messages, add a “??” after the variables (see pp288-89 in the text). So go ahead and
add ?? after all of the numeric variables in your INPUT statement from part 8 and
check your log file again. The annoying messages should be gone.
We now should have read the data in properly, but we still need to do some cleaning.
We’ll save that for homework 4.
II.
Column data
Try problem 12-5 on page 295 of your text. The “solution” is at the back of the book.
Download