ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES Analysis of Police Discrimination in the United States Miguel Bautista Daniel Petrov University at Buffalo Department of Computer Science and Engineering April 2021 1 ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES ABSTRACT Data on fatal police shootings since 2015 and US income distribution was collected, parsed, and cleaned in an attempt to analyze potential implicit or explicit discriminatory practices by police officers against certain races. The data was collected by the Washington Post and the Bureau of Economic Analysis respectively and analyzed by performing a number of IDA, EDA, and modeling techniques such as KNN, Ridge Regression, and K-Means, and K-NN. Initial analysis of the data suggested that no discrimination occurs, however, the classification models utilized are able to classify fatalities by race to a decent precision which indicates that discrimination may be occurring coupled with a thorough IDA process which revealed skewness in not only gender but race as well. Additional data and information about police interactions without fatal shootings would be needed in order to fully analyze and conclude if there is potential systemic discrimination. However, this information present shows a clear disparity towards the treatment of minority groups even just based on the distribution of the race demographic in the dataset versus the population demographic of the United States. Keywords: police, shooting, ml, regression, IDA, EDA, racism 2 ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES 3 INTRODUCTION In recent times, news outlets have been highlighting an increasing number of stories surrounding police brutality against black Americans. In response to these events, including the death of George Floyd, many protests against police brutality have begun, most notably the Black Lives Matter movement. The purpose of this project is to analyze data on fatal police shootings and to determine whether there is a correlation between race and the chances of being fatally shot by a police officer. In other words, is there a disparity between how each race or demographic is treated by police officers. One of the datasets being used is from the Washington Post and has about 6000 rows of police shootings containing features such as race, mental illnesses, whether or not they were armed, or were fleeing. The other dataset used is from the Bureau of Economic Analysis and contains information regarding the economic status of counties and states. The analysis these datasets will yield statistics on the likelihood of a fatal police encounter between black Americans and other ethnic groups. More specifically, this report will shed light on issues such as potential racial profiling and over policing of minority groups in the United States. ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES 4 DATA SOURCES 1. Data: fatal-police-shootings.csv Source: Washington Post Time Period: 2015 - 2021 Scope: United States 2. Data: CAINC1_{state}_1969_2018.csv Source: U.S. Bureau of Economic Analysis Time Period: 2018 Scope: United States The first data source is a compiled source of all fatal police shootings that have occurred in the United States from 2015-2021. It contains features such as age, race, gender, mental illness, etc. The second data source is a collection of csv files containing state, county, and country economic data which will be used in an effort to find correlation between location densities of fatal police shootings and net income of an area. The cleaned versions of these datasets are available under the “Cleaned Data” section of the created web application and machine learning model API. ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES 5 METHODS All analyses and implementations were done using Python with Pandas/NumPy and multiple libraries such as SK-Learn. Initial Data Analysis (IDA) was the first step performed and includes data cleaning and data screening. IDA is essential because it focuses on checking assumptions and performing basic hypothesis testing. Data cleaning is necessary in order to prepare the data for analysis in the modeling phase of the experimentation. Some examples of cleaning methods used are: a. Outlier Detection b. Replacing text data with numerical data c. Removing unnecessary columns d. Checking for duplicate records Next, data screening was done on the data which is the beginning of statistical analysis and helped address initial research questions. Examples of data screening techniques that were performed are: a. Analyzing fatalities by gender b. Adjusting scaling of data c. Normalizing data d. Checking skewness e. Analyzing fatalities by race and age and comparing those to the population distribution After IDA was completed, Phase 2 was Exploratory Data Analysis (EDA). EDA is used as an approach to analyzing the data set and summarizing the main characteristics by primarily using data visualization. Examples of EDA performed are: a. Box Plot - Association between race and age b. Map of Police Shootings by State c. Bar Chart of Threat Level by Race d. Pie chart of Fatalities by race and age e. Bar Chart comparison of total population percentage of race vs percentage of shootings f. Line graph of rate of shootings over time Modeling of the data was performed next, which is described in the following section. ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES 6 EXPERIMENTS AND DISCUSSION/ANALYSIS Phase 3 of the project is the analytics phase where models and algorithms were applied to the cleaned data in order to analyze and attempt to answer the hypotheses. There was a total of 5 modeling algorithms that were applied to the data. 1. Linear Regression was used to determine whether there is a relationship between fatalities and total area income. The typical news narrative is that lower income areas and highly populated cities suffer from over policing and consequently, higher rates of fatalities by police. We want to see if there is some correlation between the income of an area and the number of fatalities in that area. After observing the data in a scatter plot, the relationship appeared linear. Furthermore, we would like to model the predicted number of fatalities based on income which could be used later down the line to create a model related to population. According to the calculated R2 value (0.84584), there is a strong positive correlation. This means that there are more fatalities if an area makes more. However, this proves that big cities do not have more police deaths as the results are linear. If there was a higher rate of deaths in populated areas, the graph would have an exponential relationship and this model would not be viable. Looking at the scatter plot and the regression line, this model provides a decent approximator to calculate total deaths by income. Figure 1.0: Line and scatter plot showing linear model against dataset points. ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES 7 2. Ridge Regression was used to find whether the rate of fatal shootings increased as the year went on. 2020 and 2021 have put a spotlight on police brutality but we are aiming to verify if the recent spotlight is something new or if it has been happening in the past. Is there media bias surrounding police brutality that makes it appear that there have been more deaths by police officers than ever before? After observing the data in a scatter plot, the relationship appeared linear. Rather than choosing a linear regression, since the data showed high multicollinearity a ridge regression seemed better suited for the task. Separating the data frame into the 2 columns needed, the linear regression was completed with the cumulative sum of the number of fatalities per month (grouped by year) and the month of that cumulative sum. Next, the data was divided into 80% for the regression training data and the remaining 20% to test. We can now evaluate with near certain accuracy, the number of fatalities at any given point of the year. According to the calculated R2 value (0.99808), there is a strong positive correlation. This means that despite the media claiming a dramatic increase in police brutality and fatalities, this number has not changed much throughout the years. In fact, with such a high correlation, it appears that there has been almost no change as each point in the scatter plot lies extremely close to the regression line. Looking at the scatter plot and the regression line, this model provides a near certain approximator to calculate total deaths at any point in the year. Figure 1.1: Line and scatter plot showing linear model against dataset points. ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES 8 3. K-Nearest Neighbors was used in an attempt to verify if there are any discernible indicators in identifying the race of an individual who is killed by a police officer. If there is, this would prove some inherent or trained bias that police officers have that unfortunately affect certain races disproportionately. The race column is removed from the normalized data and then partitioned by 80% for the training data and 20% for the test data which will be used to check the accuracy of the algorithm. The race column is saved in order to create a confusion matrix to look at the K-NN predictions. Furthermore, it is transformed by using the Label Encoder to remove any continuous features in the labels. The accuracy of the model is 71.97% which is a standard accuracy for a K-NN algorithm, especially with low volumes of data. Unfortunately, this data uncovers that there are discernable characteristics between police interactions of every race. This can be seen when looking at the confusion matrix diagonal which shows properly labeled data. This means that the determination of threat level, being armed, the income of the area, etc., all play a part in this distinction. Tangentially, this may show some wider form of a systemic problem such as over-policing and lack of training of those police officers to keep up with bigger populations and higher crime rates. Figure 1.2: Confusion matrix showing ML model predictions against the correct label ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES 9 4. Multinomial Naive Bayes Classification was used to determine if the people that are being targeted or fatally shot by police officers have similar characteristics. If they do, this would show that there is some sort of underlying bias from police officers against certain races. The algorithm is trained by the data we provide it and will classify an individual's race based on various individual features. If the accuracy of the model is high, it will show that certain races have distinguishing characteristics that allow them to be classified. The model produced an accuracy of 86%. The high model accuracy shows that there seems to be a correlation between race and certain characteristics (having a mental illness, being armed, having a high/low threat level, etc.). This means that the determination of threat level, being armed, the income of the area, etc., all play a part in this distinction. This may show that there is an underlying bias or lack of training by police officers. ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES 10 5. K-Means Algorithm was used as a method to cluster the data and group it based on a certain attribute. The ideal number of clusters selected was determined by the elbow method, and was indicated to be 6. The number of races in the data is also a total of 6. This indicates that the algorithm grouped the data based on the attribute of race. This indicates that race plays a role in fatal shootings and suggests that discrimination could be occurring. The total variance of the algorithm was relatively small around ~14. Figure 1.3: Elbow method visualization for k selection Figure 1.4: K-Means cluster visualization ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES 11 CONCLUSION/RECOMMENDATIONS Based on the exploratory data analysis and the modeling performed in phase 3, discrimination by police officers is occurring however, to prove causation versus correlation in the analysis, more comprehensive datasets would be needed. For example, in the EDA we found that the population percentage of each race does not align with the percentage of fatal shootings for each race. African Americans make up about 12% of the population yet were involved in 27% of fatal shootings. The dramatic difference is unsettling because it is an indicator that there may be discrimination occurring. However, the bar graph that shows Association Between Threat Level and Race shows African Americans with the highest threat levels of all races, which may explain the high percentage of fatal shootings. Furthermore, all of the modeling techniques point toward a form of police discrimination. For example, the K-NN and K-Means algorithm showed that there are discernable characteristics between police interactions of every race. This can be seen when looking at the confusion matrix diagonal which shows properly labeled data. This means that the determination of threat level, being armed, the income of the area, etc., all play a part in this distinction. Tangentially, this may show some wider form of a systemic problem such as over-policing and lack of training of those police officers to keep up with bigger populations and higher crime rates. The modeling, although helpful in providing initial indications of the data, cannot allow us to fully conclude if there is an underlying bias by police officers against certain races. Further analysis on the data needs to be performed by using additional data that includes information on individuals that have not been fatally shot by police officers but have had some sort of interaction with them. This would allow for a more complete data set. The classification and modeling algorithms would also be more accurate as they would be able to better conclude if discrimination based on race is occurring. ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES 12 REFERENCES Shane, J. M., Lawton, B., & Swenson, Z. (2017). The prevalence of fatal police shootings by US police, 2015–2016: Patterns and answers from a new data set. Journal of criminal justice, 52, 101-111. Learn. (n.d.). Retrieved March 29, 2021, from https://scikit-learn.org/stable/ Exploratory data analysis. (n.d.). Retrieved March 29, 2021, from https://www.itl.nist.gov/div898/handbook/eda/eda.htm