December 13, 2004 Violent Crime in America Introduction Violent crime in the United States is an important subject, particularly in New York City where people perceive the risk of being victimized by crime to be relatively high. As residents of New York City, the risk of violent crimes affects the way we live our lives, whether or not we actually become a victim of a crime. We have to think twice about traveling alone on the subway late at night, or jogging in Central Park after dark. Therefore, in thinking about the quality of our lives here, we wonder what societal factors must be in place in order to live in a more peaceful world, where the risks of being a violent crime victim would be lower (or maybe we should just move out of the city). Our data analysis project analyzes violent crime in America. We will determine the most important statistical drivers of violent crime over the period 1970-2002. We are interested in other environmental/societal factors that fluctuate year to year that may be correlated with the rate of violent crime. Are there factors that we assume are correlated but are really not? Are there factors that we assume to have no association with violent crime but that really do? We aim to draw conclusion about what factors must be in place in order for violent crime to be reduced over the next 30 years. The Data To analyze the violent crime rate, and its drivers, we have collected the following data: 1 Data Violent crime rate (target variable) Unemployment rate Federal Prison population Poverty rate Economic growth – GDP Source Bureau of Justice Statistics Frequency Annual Timeframe 1960-2002 Census Bureau Federal Bureau of Prisons Census Bureau Bureau of Economic Analysis Monthly Annual Annual Annual 1960-2002 1970-2002 1960-2002 1960-2002 In collecting the data, we have already faced several issues. First, we had expected to analyze data for 1964-2003, however several of the data series are not available as far back as 1964 so we will limit it to 1970-2002. Specifically, were unable to find data on prison population, going back to the 1960’s; therefore, we have chosen to use the Federal Prison Population instead, since this data series extends back to 1970. Although less ideal than the total U.S. prison population, we believe the Federal data series may add to our understanding. The second issue we faced is that we have chosen to analyze annual observations; however, the unemployment data seems to be only available on a monthly basis therefore we had to transform it to annual data. Since we don’t have weightings, we have annualized it by calculating an unweighted mean of the monthly data. This transformation could potentially have a negative affect on the validity of our conclusions. We also expected an issue with data for 2001 if victims of 9/11 were counted as victims of violent crime, but upon further analysis, they were not. Were that the case, looking at our other variables in 2001 would not have been as relevant as it is in other years. Finally, our data is in different units: some are rates (crime rate, unemployment rate) that fluctuate over time while some are absolute numbers (prison population) that tend to grow over time. We may need to transform our data to make regression analysis more meaningful. 2 Expected Outcome Through an analysis of the data we expect to find that the unemployment rate is correlated to the violent crime rate and that higher unemployment produces higher violent crime. This is because unemployment produces lower income which may drive crime related to robbery. We expect that a higher poverty rate will be associated with higher crime for the same reason. We expect that when GDP is lower or falling, violent crime will rise. We expect higher prison population to be associated with lower violent crime because those most likely to commit violent crime are incarcerated. We believe that by statistically analyzing the violent crime rate and its potential drivers, we can increase our understanding of crime and what factors are associated with a lower incidence of it. General Observation of the Variables We will begin our analysis with an examination of the descriptive statistics as well as a histogram for each of our variables. This will enable us to determine whether or not the data is normally distributed and to see if there are any variables they may cause problems when we go deeper into our statistical analysis of the data. The descriptive statistics are as follows: 3 Descriptive Statistics Variable Violent Crime ra Total Prison Pop Avg Annual Unemp Poverty for Fami GDP in billions Mean 565.9 48598 6.285 10.361 4883 Variable Violent Crime ra Total Prison Pop Avg Annual Unemp Poverty for Fami Maximum 758.1 128090 9.700 12.300 SE Mean 18.8 5937 0.244 0.190 511 StDev 107.9 34105 1.404 1.091 2937 Minimum 363.5 19023 4.000 8.700 1039 Q1 491.2 21654 5.350 9.300 2163 Median 556.6 30104 6.000 10.300 4463 Q3 636.9 75453 7.200 11.300 7235 As is apparent in the data above, some of the variables seem to be fairly normally distributed as the mean and median for the variables are similar to each other. This fact is supported by each of the histograms we looked at as well. The exceptions to this are the variables Total Prison Population and GDP, which both have a higher mean relative to the median. This lack or normality is apparent in the histograms of each of these variables as seen below. Histogram of Violent Crime rate 7 6 Frequency 5 4 3 2 1 0 400 500 600 Violent Crime rate 700 Histogram of Total Prison Pop Histogram of Avg Annual Unemployment Rate 10 9 8 7 Frequency Frequency 8 6 4 6 5 4 3 2 2 1 0 30000 60000 90000 Total Prison Pop 120000 0 4.0 4.8 5.6 6.4 7.2 8.0 Avg Annual Unemployment Rate 8.8 9.6 4 Histogram of Poverty for Families Histogram of GDP in billions of current doll 6 5 4 4 Frequency Frequency 5 3 3 2 2 1 1 0 9 10 11 Poverty for Families 0 12 2000 4000 6000 8000 GDP in billions of current doll 10000 Because Total Prison Population has a long right tail, we decided to perform a transformation by taking a log base 10 of the data in order to see if that would help create a more normal distribution. We also logged the GDP data, since it is money data. As is apparent from the histograms of the logged data, this transformation did not seem to sufficiently affect the distribution of the data. Histogram of LogT GDP 5 8 4 Frequency Frequency Histogram of LogT Prison Pop 10 6 4 2 0 3 2 1 4.4 4.6 4.8 LogT Prison Pop 5.0 0 3.0 3.2 3.4 3.6 3.8 4.0 LogT GDP This may have to do with the fact that these are time series data, fixing which is beyond the scope of this project. While taking the logs for Prison Population and GDP did not make them normally distributed, we decided to continue using this logged data in the rest of our analysis. We also examined correlations among our variables, substituting our two transformed variables for their original variables. The best regressions arise when the predictor variables are highly correlated with the target variable but not with each other. 5 In our data, the poverty rate and log of GDP are highly correlated with the violent crime rate; however, several pairs of predictor variables are highly correlated with one another. Correlations Violent Crim 0.129 0.656 0.395 0.647 Avg Annual U Poverty for LogT Prison LogT GDP Avg Annual U Poverty for 0.596 -0.516 -0.277 LogT Prison 0.017 0.284 0.888 Single Variable Regressions While we are ultimately concerned with how all the variables together predict Violent Crime, we are first going to examine how each one, on its own, relates to our target. To do this, we created a scatter plot with a fitted regression line for each of the predictor variables against the target of violent crime rate, as displayed below. Fitted Line Plot Fitted Line Plot Violent Crime rate = 503.5 + 9.94 Avg Annual Unemployment Rate 800 S R-Sq R-Sq(adj) Violent Crime rate = - 106.2 + 64.87 Poverty for Families 800 108.676 1.7% 0.0% 600 500 400 S R-Sq R-Sq(adj) 83.5545 41.9% 40.0% 600 500 400 4 5 6 7 8 Avg Annual Unemployment Rate 9 10 8.5 Fitted Line Plot 800 9.0 9.5 10.0 10.5 11.0 Poverty for Families 11.5 12.0 12.5 Fitted Line Plot Violent Crime rate = - 130.9 + 151.7 LogT Prison Pop Violent Crime rate = - 249.2 + 226.7 LogT GDP S R-Sq R-Sq(adj) 800 100.662 15.6% 12.9% 700 Violent Crime rate 700 Violent Crime rate 82.6857 43.1% 41.2% 700 Violent Crime rate Violent Crime rate 700 S R-Sq R-Sq(adj) 600 500 400 600 500 400 4.2 4.3 4.4 4.5 4.6 4.7 4.8 LogT Prison Pop 4.9 5.0 5.1 3.0 3.2 3.4 3.6 LogT GDP 3.8 4.0 6 In looking at the slope of the fitted line, all of the variables appear to have a positive relationship with the target, indicating that as each variable increases, the violent crime rate increases as well. That being said, however, it seems that no one variable alone has a very strong correlation with the violent crime rate. For instance, the variability between the violent crime rate and the log of GDP is increasing over time. We can therefore conclude at this point that each variable on its own is not a good predictor of violent crime. It is our hope that when these variables are acting together, the relationship will be stronger and as a group perhaps they will be better predictors of the violent crime. In order to determine this, we will move on to our next step in analyzing the data, that of a multiple regression model. Initial Multiple Regression Next we ran a multiple regression of the violent crime rate and our four predictor variables (Avg Annual Unemployment Rate, Poverty for Families, log of GDP Current Dollars, and log of Federal Prison Population). The regression equation is given below. Regression Analysis The regression equation is Violent Crime rate = - 96 - 16.0 Avg Annual Unemployment Rate + 52.8 Poverty for Families - 200 LogT Prison Pop + 316 LogT GDP Predictor Constant Avg Annual Unemployment Rate Poverty for Families LogT Prison Pop LogT GDP S = 63.1435 R-Sq = 70.0% Coef -96.2 -16.05 52.77 -199.9 315.51 SE Coef 302.4 13.16 15.87 110.7 97.71 T -0.32 -1.22 3.32 -1.81 3.23 P 0.753 0.233 0.002 0.082 0.003 R-Sq(adj) = 65.7% In looking at the coefficients of this regression equation, we learn for example that holding all else fixed, a one point increase in the poverty rate is associated with a 52.77 point increase in the violent crime rate. Similarly, the coefficient of the log of the 7 prison population tells us that every one point increase in the log of the prison population is associated with a negative 199.9 point impact on the violent crime rate. Interestingly, an increase in the unemployment rate is associated with a decrease in the violent crime rate, and an increase in the logged GDP is associated with an increase in the violent crime rate. Next, the regression model succeeded in reducing the noise in the violent crime rate from 107.9 before the regression to a standard error of regression of 63.1. This means that we are confident that 95% of the time our regression model can predict the crime rate to within 2*63.1. This is an indication that a prediction of violent crime using this regression equation would be much more accurate than an estimate based solely on its historical mean and variance. In addition to looking at the standard error, it is also important to examine the degree to which these four variables explain the variance in the violent crime rate. To do this we looked at the adjusted R-Sq. The adjusted R-Sq indicates that the four predictor variables account for 65.7% of the variance in the violent crime rate. It is difficult for us to tell at this time whether this R-Sq is better or worse than other models that attempt to explain crime. Finally we considered the T and P values of the predictor variables to determine if each is significant to the regression equation. There are two variables for which the Pvalue is above 0.05 (the log of the prison population and the unemployment rate); therefore, these variables appear statistically insignificant to the model. This indicates that perhaps these variables could be removed without much reduction in model power. Assumptions Linear regression involves four major assumptions, and this regression violates two of the four. The first assumption is that the expected value of the error terms for all 8 observations is equal to zero. Judging by the Residuals Versus the Fitted Values plot below, the expected value of the error terms appears approximately equal to zero. Also, there are no known subgroups whose fitted values are systematically above or below the regression line. We believe this first assumption holds. The second assumption is homoscedasticity, that the regression relationship is equally strong throughout the population. That assumption does not hold in this regression. The Residuals Versus the Fitted Values plot shows that the variance is not constant – the variance is larger for larger fitted values. The third assumption is that the residual of one term tells us nothing about the residual of another term. This assumption is violated in this regression, as it is in many regressions of time series data. The Residuals Versus the Order of the Data plot shows that each residual is related to the residual of the prior observation. The fourth assumption of linear regression is that the residuals are normally distributed. The plots Normal Probability Plot of the Residuals and Histogram of the Residuals show that the residuals are approximately normal; therefore this assumption holds for this regression. 9 Residual Plots for Violent Crime rate Normal Probability Plot of the Residuals Residuals Versus the Fitted Values 99 100 50 Residual Percent 90 50 0 -50 10 -100 1 -100 0 Residual 100 400 100 4.5 50 3.0 0 -50 1.5 -100 0.0 -100 -50 0 Residual 50 100 1 Residuals Versus LogT GDP 10 15 20 25 Observation Order 30 (response is Violent Crime rate) 100 100 50 50 Residual Residual 5 Residuals Versus LogT Prison Pop (response is Violent Crime rate) 0 -50 0 -50 -100 -100 3.0 3.2 3.4 3.6 LogT GDP 3.8 4.0 4.2 4.3 Residuals Versus Poverty for Families 4.4 4.5 4.6 4.7 4.8 LogT Prison Pop 4.9 5.0 5.1 Residuals Versus Avg Annual Unemployment Rate (response is Violent Crime rate) (response is Violent Crime rate) 100 100 50 50 Residual Residual 700 Residuals Versus the Order of the Data 6.0 Residual Frequency Histogram of the Residuals 500 600 Fitted Value 0 -50 0 -50 -100 -100 8.5 9.0 9.5 10.0 10.5 11.0 Poverty for Families 11.5 12.0 12.5 4 5 6 7 8 Avg Annual Unemployment Rate 9 10 10 In addition to considering the four assumptions, we also looked for any outliers in the data by more closely examining the Normal Probability Plot of the Residuals. We noticed a couple of outliers toward the very top of the graph. Upon analysis of these outliers, we believe they occurred due to the relative increase in the crime rate during the early 1990s and do not feel it necessary to remove the data points from our model at this time. Improving the Model Several factors indicate that our initial model may not be the optimal model possible with our predictor variables. First, two variables, the unemployment rate and the log of prison population, have p-values below 0.05. Second, our model violates three of the four assumptions of linear regression. To improve the model, we ran a “best subsets” regression, the output of which follows. Best Subsets Regression Response is Violent Crime rate A=Avg Annual Unemployment Rate B=Poverty for Families C=LogT Prison Population D=LogT GDP Vars 1 2 3 4 R-Sq 43.1 66.2 68.4 70.0 R-Sq(adj) 41.2 63.9 65.2 65.7 Mallows C-p 24.2 4.6 4.5 5.0 S 82.686 64.792 63.672 63.143 A B C D X X X X X X X X X X The best subsets analysis indicates that only two variables are necessary to have an adjusted R-Sq of 63.9%, whereas our four-variable equation had an adjusted R-Sq of 65.7%, a very small difference. The two variables that add so little power to the model are the unemployment rate and the log of the prison population; these are the same two variables with low p-values in our initial regression. We believe that by eliminating these 11 two variables, the model will maximize the trade-off between model power and complexity. Our optimal model then is as follows. Regression Analysis The regression equation is Violent Crime rate = - 592 + 50.8 Poverty for Families + 176 LogT GDP Predictor Constant Poverty for Families LogT GDP S = 64.7915 Coef -591.8 50.81 175.59 R-Sq = 66.2% SE Coef 153.2 10.94 38.79 T -3.86 4.64 4.53 P 0.001 0.000 0.000 R-Sq(adj) = 63.9% This new model explains 63.9% of the variance in the violent crime rate (as indicated by the adjusted R-Sq). The original noise in our target variable was 107.9; our model reduces noise in the target variable to 64.8 (the standard error of regression). Both predictor variables are significant to the model (as indicated by p-values less than 0.05). The equation tells us that, all else held constant, a one point increase in the poverty rate is associated with a 50.81 point increase in the violent crime rate. Similarly, a one point increase in the log of GDP is associated with a 175.59 point increase in the violent crime rate. This new model conforms to the four assumptions of linear regression better than our initial model did. It does not violate the first assumption (expected value of error terms equal to zero), as seen in the below plot. This regression does violate the second assumption (homoscedasticity) since variance of the residuals is higher for larger fitted values, but the variance is more constant than in our initial model. This regression also violates the third assumption (residuals tell us nothing about one another) since it is a time series. The fourth assumption (normality of residuals) is not violated by this regression equation. While not exactly normal, the residuals are approximately normal 12 and certainly more normal than the residuals of our initial regression equation. In sum, our improved model violates two of the four linear regression assumptions, whereas our initial model violated three of the four. Residual Plots for Violent Crime rate Normal Probability Plot of the Residuals Residuals Versus the Fitted Values 99 100 Residual Percent 90 50 10 50 0 -50 -100 1 -100 0 Residual 100 400 Histogram of the Residuals 700 Residuals Versus the Order of the Data 8 100 6 Residual Frequency 500 600 Fitted Value 4 2 50 0 -50 -100 0 -120 -60 0 Residual 60 120 1 Residuals Versus Poverty for Families 100 100 50 50 Residual Residual 150 0 0 -50 -50 -100 -100 9.5 10.0 10.5 11.0 Poverty for Families 11.5 30 (response is Violent Crime rate) 150 9.0 10 15 20 25 Observation Order Residuals Versus LogT GDP (response is Violent Crime rate) 8.5 5 12.0 12.5 3.0 3.2 3.4 3.6 LogT GDP 3.8 4.0 Initial Conclusion and Original Expectations First let us take a look at the nature of the relationship of the national violent crime rate with each of the predictor variables, based on the multiple regression model we ran. In half of the cases the direction of the relationship matched our expectations, and in the other half the relationship was the opposite of what we had expected. As stated 13 earlier, we had assumed that an increase in GDP would be associated with a decrease in the crime rate, this does not seem to be the case based on the positive coefficient for the logged GDP. It seems that there is actually a positive rather than negative relationship between the two—an increase in GDP is associated with an increase in the violent crime rate. Additionally, we had expected that an increase in the unemployment rate would be associated with a decrease in the violent crime rate. However, based on the negative coefficient for unemployment, it seems that an increase in unemployment, in our model, is actually associated with a decrease in violent crime. The other two variables do in fact have the relationships we assumed they would have. An increase in the poverty rate correlates with an increase in the violent crime rate as interpreted by the positive coefficient for the poverty rate. In addition, as we had assumed, an increase in the prison population is associated with a decrease in the crime rate. These associations, of course. assume all other variables are held constant. More importantly perhaps, we chose these four variables under the assumption, prior to statistically analyzing the data, that all four variables together would serve as a fairly good predictor of the national violent crime rate. After looking at the multiple regression model for the data, the results do not fully support our original expectations. To begin with, in order to strengthen our analysis we had to make the choice to completely remove two of the four variables, the unemployment rate and the prison population. We now believe that the national rate of violent crime for the period 19702000 is best explained by the poverty rate and the level of GDP. That said, violent crime is quite difficult to predict using the data we have analyzed thus far. Therefore, we decided to try one last thing in our effort to predict the national violent crime rate. 14 Incorporating a Lagged Variable We considered the fact that the best predictor of the violent crime rate may be the violent crime rate of the prior year. To examine this we first ran a correlation between the violent crime rate and the lag (by one period) of the violent crime rate. Correlations: Violent Crime rate, Lag of Violent Crime Rate Pearson correlation of Violent Crime rate and Lag of Violent Crime Rate = 0.957 This very high correlation of 0.957 tells us that the violent crime in one period is likely to have predictive power in predicting the violent crime rate of the next period. We next constructed a second best subsets regression but this time included the lag variable. Best Subsets Regression Response is Violent Crime rate 32 cases used, 1 cases contain missing values A=Avg Annual Unemployment Rate B=Poverty for Families C=LogT Prison Population D=LogT GDP E=Lag of Violent Crime Rate Vars 1 2 3 4 5 R-Sq 91.5 93.3 94.1 94.4 94.5 R-Sq(adj) 91.2 92.9 93.5 93.6 93.5 Mallows C-p 12.4 5.8 3.9 4.5 6.0 S 30.531 27.557 26.284 26.105 26.326 A B C D E X X X X X X X X X X X X X X X The result was surprising: a regression with only the lag variable had an adjusted R-Sq of 91.2%, significantly higher than the 63.9% adjusted R-Sq of our previous best subsets model. Once the lag variable was included, the other variables added little additional power. As a result, our new best model has only the lag of the violent crime rate as predictor. The regression equation for this model is below. 15 Regression Analysis: Violent Crime rate versus Lag of Violent Crime Rate The regression equation is Violent Crime rate = 56.8 + 0.907 Lag of Violent Crime Rate 32 cases used, 1 cases contain missing values Predictor Constant Lag of Violent Crime Rate S = 30.5314 R-Sq = 91.5% Coef 56.83 0.90719 SE Coef 29.13 0.05039 T 1.95 18.00 P 0.061 0.000 R-Sq(adj) = 91.2% Analysis of Variance Source Regression Residual Error Total DF 1 30 31 SS 302126 27965 330091 MS 302126 932 F 324.11 P 0.000 Residual Plots for Violent Crime rate Normal Probability Plot of the Residuals Residuals Versus the Fitted Values 99 50 Residual Percent 90 50 0 10 1 -50 -80 -40 0 Residual 40 80 400 Histogram of the Residuals 500 600 Fitted Value 700 Residuals Versus the Order of the Data 50 6 Residual Frequency 8 4 0 2 0 -50 -40 -20 0 20 Residual 40 60 1 5 10 15 20 25 Observation Order 30 The coefficient tells us that each one point increase in the violent crime rate is associated with a 0.907 increase in the violent crime rate for the following year. This regression reduces the noise of the response variable to a standard error of 30.5 from an 16 original standard deviation of 107.9. The adjusted R-Sq tells us that the regression explains 91.2% of the variance in the violent crime rate. The p-value for the predictor variable tells us that the probability that the coefficient is actually zero is less than 0.0005. Since this is now a one variable regression, the F statistic and associated p value tell us the same information as the p value of the coefficient. Our new regression violates two of the four assumptions of linear regression. It does not violate the first assumption, since the expected value of the residuals appears close to zero. The second assumption is violated since the residuals exhibit non-constant variance; the variance increases for larger fitted values. The regression violates the third assumption since each residual value is related to the residual of the prior year. Our regression does not violate the fourth assumption since the residuals are approximately normally distributed. Implications Our analysis has taught us three lessons. First, we learned that the poverty rate and the growth of the economy are each more highly correlated with the violent crime rate than the unemployment rate and the federal prison population are. Second, we confirmed that a higher poverty rate is associated with a higher rate of violent crime and learned that a larger economy is associated with a higher rate of violent crime. Third, we learned that the most effective data for predicting the violent crime rate is in fact the rate itself, from the previous year. In conclusion, it seems reasonable to suppose that the since the violent crime rate has fallen every year for the past ten years that it may do so next year as well. As for predicting the national violent crime rate based on our original four variables, we found that it is quite difficult to do. 17