90-776 Manipulation of Large Data Sets Lab 2 March 17, 1999

advertisement
90-776 Manipulation of Large Data Sets
Lab 2
March 17, 1999
Major Skills covered in today’s lab:
 Describing data statistically
 Describing you data using tables and graphs
 Outputting data sets from procedures
 Formatting your data
I.
Bring in the data
For this lab, we will use the SAS Gallup data set you created in the first homework
problem of assignment 1. You may want to refresh your memory with regard to the
variable definitions that are listed on the data web page:
www.andrew.cmu.edu/course/90-776/data/gallup.doc
Please write a short program to do the following
1) Bring in the SAS Gallup data set into a new temporary SAS data set. (Note: if you
want to create a permanent SAS data set containing formats, you must make the
format library permanent, also. To see how to do this, see pp. 309-310 in the text.)
2) Create a new variable called ed that equals ‘gs’ if the person has less than a high
school education, =’hs’ if the person is a high school graduate, and =’cl’ if the person
has more than a high school education.
3) Create variable labels for the following variables: salary, hours, rate.
4) Create value labels for the following variables: location, gend, ed, and rate. (Note,
you will have to put your PROC FORMAT procedure before your data statement. In
your data statement, you will then need to assign the formats to the variables using
the FORMAT statement).
5) Do a contents procedure of your data (notice what SAS has done with your
formatting).
6) Do a frequency distribution of location, gend, ed, and rate. (Note how the value
formats changed the output).
7) Find the number of observations, mean, and standard deviation for all of the variables
(use the N, MEAN, and STD options in your PROC MEANS statement). (Note the
variables with defined labels.)
II.
Using the data
For this part of the lab, you may add your SAS commands to your above program, or just
enter the commands interactively.
1) Use PROC MEANS to find the mean salary and a 95% confidence interval around
that estimate. (To do this, you need to use the VAR subcommand. You also need to
use the MEAN and CLM options in your procedure statement).
2) Use PROC UNIVARIATE to find the location of the five highest and lowest earning
(salary variable) people. (The five largest and smallest values are standard output
1
3)
4)
5)
6)
7)
8)
9)
from PROC UNIVARIATE. To identify where those people are from, you need to
include the statement id location; in your procedure.)
Re-run the PROC UNIVARIATE from part 2 including a normality test and plots
(include the options plot and normal in your procedure statement). Is the data
normally distributed (the null hypothesis of the normality test is that the data IS
normally distributed)?
Re-run the PROC UNIVARIATE from part 3 only for people whose income falls
between $1,000 and $100,000. (To limit a procedure to only certain observations of
the data set, you use the WHERE command. So, all you need to do is include the
following statement in your procedure: WHERE 1000 < salary <100000; or you
can write it as: WHERE salary > 1000 and salary < 100000; .) Now is your data
normal?
Find the mean salary, age and months unemployed by location using the BY method.
(First sort the data by location, then include BY location; in your MEANS
procedure.)
Find the mean salary, age and months unemployed by location using the CLASS
method (Include CLASS location; in your MEANS procedure).
Now we want to create a new output (temporary) data set containing the mean salary,
age, and months unemployed by location. (To do this, repeat problem 6, but use
PROC SUMMARY instead of MEANS and include an output statement:
OUTPUT OUT=data_set_name mean=m_sal;) Now, look at your new data set using
PROC PRINT. ( PROC PRINT data=data_set_name;)
Create a cross tabulation (two-way table) of the GEND and ED variables. (Use
PROC FREQ and the TABLES GEND*ED; subcommand).
Practice making some graphs using PROC CHART with VBAR and HBAR. Throw
in some grouping, sumvars and types, too!
Try
proc chart data=data_set_name;
vbar salary;
run;
Now try
proc chart data=data_set_name;
vbar salary;
where 0<salary < 50000;
run;
10) Practice using PROC PLOT. For example, plot wage and age:
PROC PLOT data=data_set_name;
PLOT wage*age; run;
What do the A’s, B’s, etc. mean?
2
Download