90-776 Manipulation of Large Data Sets Lab 2 March 17, 1999 Major Skills covered in today’s lab: Describing data statistically Describing you data using tables and graphs Outputting data sets from procedures Formatting your data I. Bring in the data For this lab, we will use the SAS Gallup data set you created in the first homework problem of assignment 1. You may want to refresh your memory with regard to the variable definitions that are listed on the data web page: www.andrew.cmu.edu/course/90-776/data/gallup.doc Please write a short program to do the following 1) Bring in the SAS Gallup data set into a new temporary SAS data set. (Note: if you want to create a permanent SAS data set containing formats, you must make the format library permanent, also. To see how to do this, see pp. 309-310 in the text.) 2) Create a new variable called ed that equals ‘gs’ if the person has less than a high school education, =’hs’ if the person is a high school graduate, and =’cl’ if the person has more than a high school education. 3) Create variable labels for the following variables: salary, hours, rate. 4) Create value labels for the following variables: location, gend, ed, and rate. (Note, you will have to put your PROC FORMAT procedure before your data statement. In your data statement, you will then need to assign the formats to the variables using the FORMAT statement). 5) Do a contents procedure of your data (notice what SAS has done with your formatting). 6) Do a frequency distribution of location, gend, ed, and rate. (Note how the value formats changed the output). 7) Find the number of observations, mean, and standard deviation for all of the variables (use the N, MEAN, and STD options in your PROC MEANS statement). (Note the variables with defined labels.) II. Using the data For this part of the lab, you may add your SAS commands to your above program, or just enter the commands interactively. 1) Use PROC MEANS to find the mean salary and a 95% confidence interval around that estimate. (To do this, you need to use the VAR subcommand. You also need to use the MEAN and CLM options in your procedure statement). 2) Use PROC UNIVARIATE to find the location of the five highest and lowest earning (salary variable) people. (The five largest and smallest values are standard output 1 3) 4) 5) 6) 7) 8) 9) from PROC UNIVARIATE. To identify where those people are from, you need to include the statement id location; in your procedure.) Re-run the PROC UNIVARIATE from part 2 including a normality test and plots (include the options plot and normal in your procedure statement). Is the data normally distributed (the null hypothesis of the normality test is that the data IS normally distributed)? Re-run the PROC UNIVARIATE from part 3 only for people whose income falls between $1,000 and $100,000. (To limit a procedure to only certain observations of the data set, you use the WHERE command. So, all you need to do is include the following statement in your procedure: WHERE 1000 < salary <100000; or you can write it as: WHERE salary > 1000 and salary < 100000; .) Now is your data normal? Find the mean salary, age and months unemployed by location using the BY method. (First sort the data by location, then include BY location; in your MEANS procedure.) Find the mean salary, age and months unemployed by location using the CLASS method (Include CLASS location; in your MEANS procedure). Now we want to create a new output (temporary) data set containing the mean salary, age, and months unemployed by location. (To do this, repeat problem 6, but use PROC SUMMARY instead of MEANS and include an output statement: OUTPUT OUT=data_set_name mean=m_sal;) Now, look at your new data set using PROC PRINT. ( PROC PRINT data=data_set_name;) Create a cross tabulation (two-way table) of the GEND and ED variables. (Use PROC FREQ and the TABLES GEND*ED; subcommand). Practice making some graphs using PROC CHART with VBAR and HBAR. Throw in some grouping, sumvars and types, too! Try proc chart data=data_set_name; vbar salary; run; Now try proc chart data=data_set_name; vbar salary; where 0<salary < 50000; run; 10) Practice using PROC PLOT. For example, plot wage and age: PROC PLOT data=data_set_name; PLOT wage*age; run; What do the A’s, B’s, etc. mean? 2