90-776 Manipulation of Large Data Sets Lab 6 April 21, 1999

advertisement
90-776 Manipulation of Large Data Sets
Lab 6
April 21, 1999
Major Skills covered in today’s lab:
 Using SAS to compute statistics
Today’s hint:
“Man without statistics is like a fish without a bicycle.”
Yet another data set! For this lab, we will use the data set CEN80.SD2 that is on the
course lan account: l:\academic\90776\data\. This data set has some 1980 decennial
census data at the ZIP code level for approximately 10 states. All of the variables in the
data set are labeled, so the first thing to do is a contents procedure.
I.
Correlations
1) Find the correlation among the total number of people in poverty (BPOV80), the total
value of owner-occupied housing (VLOOC80), and the total gross ret of renter
occupied housing (VLRN80). What do you expect the relationship to be among the
three variables? What do you find? Are the correlations significant (P-values small)?
2) Why do you think more people in poverty is positively correlated with higher housing
values and rents?
3) Next, create new variables (in a new temporary data set) that measure
A) The percent of people in poverty: POVPCT = BPOV80/BSPOV80
B) Average housing value: VLOOCAV = VLOOC80/HUOOC80
C) Average rental price: VLRNAV = VLRN80/RNTOT80
You may want to label these variables.
These variables are the values divided by the totals, and they give an average value
for that ZIP code. This solves the problem from above (more people means more
people in poverty; more houses means greater total housing value, etc.). When
working with aggregate data, it is very important to create the proper averages!
4) Check your log file. Notice all of the error messages. For some of the ZIPs, there are
zero values for BSPOV80, HOOC80 and RNTTOT80. SAS does not like to divide
by zero. Change your program to only set the permanent data set if all of those three
variables are greater than zero. (Do this with an IF statement right after the SET
statement).
5) Next, calculate the correlations among the three variables you created in (3). Do the
correlations now appear to be more believable?
II.
Testing Means
Let’s see if average poverty rates, housing values and rents in Pennsylvania are different
than those in the rest of the country.
1) Create a new temporary data set that sets the data you created in part I. In this data
set, create a dummy variable called PA that equals 1 if the observation is from
Pennsylvania, and equals 0 otherwise. If the observation is from Pennsylvania, its
ZIP code will be between 1500 and 19699.
2) What percent of the observations are from PA? Do a means procedure of the PA
dummy to find out.
3) Use PROC TTEST to test whether POVPCT VLOOCAV VLRNAV are different in
PA than in the rest of the states. (Use CLASS PA to tell SAS to test the means from
PA against the means from the rest of the country). Are the means significantly
different?
III.
Estimating regressions
Let’s see if we can predict the price of housing based on the income in a ZIP.
1) Create another temporary data sets the data set you created in part II. In this data set,
create an average or per-capita income variable: INCAV = TINC80/TOT_P80;
2) Regress VLOOCAV on INCAV (in your PROC REGRESS statement, the syntax is
MODEL VLOOCAV = INCAV). Also, produce a plot of housing values and income
(in your PROC REGRESS statement, the syntax is PLOT VLOOCAV*INCAV). Is
the coefficient on INCAV significant?
3) Re-run your program, this time including the predicted regression line in your plot
(the variable name is PREDICTED. and the option: /OVERLAY will allow you to
plot both the actual data points and the regression line on the same plot.) Also, do
another plot in the same regression command that plots the residuals: (PLOT
RESIDUAL.*INCAV = "o";
4) Next, also include the percent of people in poverty as a second explanatory variable
in your regression (POVPCT). Is the coefficient on this variable significantly
different than zero? Also, include a test of the null hypothesis that both coefficients
are jointly equal to zero (in your proc regress, include the statement TEST INCAV =
POVPCT=0;). Do you reject the null hypothesis? Look at the value of the F-statistic
that calculated as a result of the test. Compare its value to the value of the F VALUE
that is reported in the SAS regression output. What test do you think this F value is
reported for? (Hint, it is a test of the overall significance of the regression).
Download