Shakeria Cohen
08/27/25
Predictive Analytics
Homework #1
How much does education affect wages? The data file cps small.csv contains 1000 observations
on hourly wage rates, education, and other variables from the 1997 Current Population Survey
(CPS) .
(a) Obtain summary statistics and histograms for the variable W AGE and EDUC. Discuss the
characteristics.
The wage ranges from $2.03 to $60.19, with an average of $10.21/hr. and a median of $8.79.
Most workers earn between $5.53 (1st quartile) and $12.78 (3rd quartile). The distribution is
skewed to the right because most observations are between about $5 and $15/hr, with a long right
tail up to $60/hr.
Education ranges from 1-18 years, with an average of 13.29 and median of 13. Most workers fall
between 12 and 16 years of education (high school to some college). The histogram shows a
more symmetric distribution with clear peaks at 12 (high school) and 16 (college).
(b) Estimate the linear regression W AGE = β1 + β2EDUC + e and interpret the results.
The estimated regressions equation is Wage= -4.91 +1.14*educ
This means the intercept is -4.91 and the slope is 1.14. So, on average, each additional year of
education increases hourly wage by about $1.14. The intercept of -4.91 is not meaningful since
nobody has zero years of education. The model has a R squared of about 0.20, so education
explains around 20% of the variation in wages.
(c) What is the estimated return to education in each case? That is, for a 1% increase in wage,
what is the percentage increase in wages for the average worker?
From my regression, the slope b2 =1.14. This means each additional year of education raises
hourly wages by about $1.14/hr. To put this into percentage terms, I calculate elasticity at the
sample means: b2 *x / y = 1.14 *13.29 / 10.21 =1.48
This means that at the average workers’ wages and education level a 1% increase in education is
associated with about 1.5% increase in wages.
Shakeria Cohen
08/27/25
Predictive Analytics
(d) Calculate the residuals and plot against EDUC. Are there any patterns evident?
Small residuals mean the model is close, and the big ones mean it missed. When I plotted the
residuals against education, most of the dots stayed pretty close to zero, which is good. But I
could see the spread get bigger at higher levels of education, meaning the model’s errors aren’t
perfectly even (possible heteroskedasticity). I didn’t notice a clear curve or pattern, so using a
straight-line model still makes sense. A few big positive residuals showed up because some
people earn way more than the line predicts.
(e) Construct a histogram of the residuals and perform the Jarque-Bera test for normality. Are the
residuals compatible with the assumption of normality?
A histogram of the residuals let me check is the errors look roughly normal (bell shaped). When
I plotted mine, the bars were not perfectly symmetric, the shape was skewed to the right with a
long tail. That already hinted that the residuals might not be normal.
To be sure, I ran the Jarque-Bera test. The null hypothesis says the residuals are normal, and the
alternative says they are not. My test gave a really tiny p-value (0.00022), so I rejected the null.
This means the residuals are not normally distributed.
(f) Interpret R2 .
From my regression, the R squared was about 0.20 (20%). R2 tells me how much of the variation
in wages is explained by education. So in this case, about 20 % of the differences in people
wages can explained by their years of education. The other 80% is due to other factors that aren’t
in my model, like work experience, occupation, gender, location, etc.
(g) Predict the wage for a worker with 16 years of education
When I plug in 16 years of education, the prediction is about 13.30/hr. This means the model
expects someone with around a college degree (16 years of schooling) to earn about $13.30 on
average.
Shakeria Cohen
08/27/25
Predictive Analytics
1. One would suspect that new home construction and sales depend on mort gage interest rates. If
interest rates are high, fewer people will be able to afford to borrow the funds necessary to
finance the purchase of a new home. Builders are aware of this fact, therefore when interest rates
are high, they will be less inclined to build new homes. A question of interest is “If interest rates
go up by 1% by how much does home construction fall?”. Data on the 30-year fixed mortgage
rate, housing starts (thousands), and houses sold (thousands) are contained in the file house
starts.csv. There are 184 monthly observations from January 1990 to April 2005.
1 (a) Estimate a linear relationship of STARTS on the FIXED RATE. Interpret the intercept and
slope.
The estimated regression equation is 2992.74 -194.23*fixed rate
The intercept is 2992.74, this is the predicted number of housing starts (in thousands) if the
mortgage interest rate were 0%. Like in the food example, it’s just the baseline of the line, don’t
really make sense in practice, but it’s needed for the equation.
The slope is -194.23, this means that if the 30-year fixed mortgage rate goes up by 1 %, housing
starts decrease by about 194,000 units on average (since the variable is in thousands).
To conclude, the regression shows a negative relationship because higher mortgage rates are
linked to fewer housing starts.
(b) Obtain a scatter plot of STARTS against the FIXED RATE. Plot the fitted regression line
along with the scatter plot.
The scatterplot shows that as the 30-year mortgage rate increases, the number of housing goes
down. The points clearly slope downward, showing a negative relationship. When I add the
regression line, it fits the overall trend well. Higher interest rates are associated with fewer
housing starts.
Shakeria Cohen
08/27/25
Predictive Analytics
(c) Construct a 95% interval estimates for the slope. Interpret the CI for the slope. What does it
mean that we are “95% confident”?
The 95% confidence interval for the slope is between -214.37 and -174.10. This means I am 95%
confident that if the mortgage rate goes up by 1%, housing starts decrease by somewhere
between 174,000 and 214,000 units (thousands). When I say 95% confident, I mean that if I
repeated this process many times with different samples about 95& of those intervals would
include the true effect.
(d) Is there evidence to suggest that there is a significant relationship between the 30 year fixed
rate (FIXED RATE) and the house starts (STARTS)? Use a level of significance of 5%.
i. set up the null and alternative hypothesis
The null hypothesis states that the slope is equal to zero, which means the 30-year fixed
mortgage rate had no effect on housing starts. The alternative hypothesis states that the slope is
not equal to zero, which means mortgage rates do affect housing starts.
ii. show a sketch of the rejection region
At the 5% significance level the critical values are about +-1.97. I drew the t distribution with
both tails shaded in red to represent the rejection regions. Any test statistics that falls beyond the
cutoff points would lead me to reject the null hypothesis.
Shakeria Cohen
08/27/25
Predictive Analytics
iii. state your conclusion
My test statistic was 19.03, which is far less than -1.97 and falls well into the left rejection
region. Because the test statistics are far past the critical value, I reject the null hypothesis. I
conclude there is a statically significant negative relationship between mortgage interest rates
and housing starts.
iv. calculate the p-value for this test and perform the test using the p-value approach.
Using the p-value approach, I look at the p-value for the slope on fixed rate in my regression
output. It is essentially zero (reported as p-value: < 0.00000000000000022), which is far less
than 0.05. Because the p-value is below the 5% significance level, I reject the null hypothesis.
This confirms that the relationship between the mortgage rate and housing starts is statistically
significant and negative.
(e) Test that if the interest rate increases by 1%, then house starts will fall by 150,000. Use a
level of significance of 5%.
Here the null hypothesis is that the slope is -150, meaning that if the interest rate increases by
1%, housing starts fall by 150,000. The Alternative hypothesis is that the slope is not equal to 150. I calculated the test statistics as t = b2-(-150)/ se(b2). My slope estimate of -194.23 and
standard error of 10.21, the test statistics came out to about -4.33. The p value is below the 5%
significance level, I reject the null hypothesis. This means the data do not support the claim that
housing starts only fall by 150,000 when rates increase by 1%; instead, the fall is significantly
larger, closer to 194,000 units.
Shakeria Cohen
08/27/25
Predictive Analytics
(f) Comment on the goodness of fit of your model. Make sure to discuss both R2 and adjusted
R2
For goodness of fit, the r squared is about 0.666 and the adjusted r squared is about 0.664. That
means around 66% of the changes in housing starts can be explained by the mortgage interest
rate. The adjusted r squared is basically the same since I only have one variable. This shows the
model fits pretty well, but there is still about 34% of the variation in housing starts that comes
from other factors not in my regression.