Uploaded by Miguel Bautista

Analysis of Police Discrimination in the United States

advertisement
ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES
Analysis of Police Discrimination in the
United States
Miguel Bautista
Daniel Petrov
University at Buffalo
Department of Computer Science and Engineering
April 2021
1
ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES
ABSTRACT
Data on fatal police shootings since 2015 and US income distribution was collected, parsed, and
cleaned in an attempt to analyze potential implicit or explicit discriminatory practices by police
officers against certain races. The data was collected by the Washington Post and the Bureau of
Economic Analysis respectively and analyzed by performing a number of IDA, EDA, and
modeling techniques such as KNN, Ridge Regression, and K-Means, and K-NN. Initial analysis
of the data suggested that no discrimination occurs, however, the classification models utilized
are able to classify fatalities by race to a decent precision which indicates that discrimination
may be occurring coupled with a thorough IDA process which revealed skewness in not only
gender but race as well. Additional data and information about police interactions without fatal
shootings would be needed in order to fully analyze and conclude if there is potential systemic
discrimination. However, this information present shows a clear disparity towards the treatment
of minority groups even just based on the distribution of the race demographic in the dataset
versus the population demographic of the United States.
Keywords: police, shooting, ml, regression, IDA, EDA, racism
2
ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES
3
INTRODUCTION
In recent times, news outlets have been highlighting an increasing number of stories surrounding
police brutality against black Americans. In response to these events, including the death of
George Floyd, many protests against police brutality have begun, most notably the Black Lives
Matter movement. The purpose of this project is to analyze data on fatal police shootings and to
determine whether there is a correlation between race and the chances of being fatally shot by a
police officer. In other words, is there a disparity between how each race or demographic is
treated by police officers.
One of the datasets being used is from the Washington Post and has about 6000 rows of police
shootings containing features such as race, mental illnesses, whether or not they were armed, or
were fleeing. The other dataset used is from the Bureau of Economic Analysis and contains
information regarding the economic status of counties and states. The analysis these datasets will
yield statistics on the likelihood of a fatal police encounter between black Americans and other
ethnic groups. More specifically, this report will shed light on issues such as potential racial
profiling and over policing of minority groups in the United States.
ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES
4
DATA SOURCES
1.
Data: fatal-police-shootings.csv
Source: Washington Post
Time Period: 2015 - 2021
Scope: United States
2.
Data: CAINC1_{state}_1969_2018.csv
Source: U.S. Bureau of Economic Analysis
Time Period: 2018
Scope: United States
The first data source is a compiled source of all fatal police shootings that have occurred in the
United States from 2015-2021. It contains features such as age, race, gender, mental illness, etc.
The second data source is a collection of csv files containing state, county, and country economic
data which will be used in an effort to find correlation between location densities of fatal police
shootings and net income of an area.
The cleaned versions of these datasets are available under the “Cleaned Data” section of the
created web application and machine learning model API.
ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES
5
METHODS
All analyses and implementations were done using Python with Pandas/NumPy and multiple
libraries such as SK-Learn. Initial Data Analysis (IDA) was the first step performed and includes
data cleaning and data screening. IDA is essential because it focuses on checking assumptions
and performing basic hypothesis testing. Data cleaning is necessary in order to prepare the data
for analysis in the modeling phase of the experimentation. Some examples of cleaning methods
used are:
a.
Outlier Detection
b.
Replacing text data with numerical data
c.
Removing unnecessary columns
d.
Checking for duplicate records
Next, data screening was done on the data which is the beginning of statistical analysis and
helped address initial research questions. Examples of data screening techniques that were
performed are:
a.
Analyzing fatalities by gender
b.
Adjusting scaling of data
c.
Normalizing data
d.
Checking skewness
e.
Analyzing fatalities by race and age and comparing those to the population distribution
After IDA was completed, Phase 2 was Exploratory Data Analysis (EDA). EDA is used as an
approach to analyzing the data set and summarizing the main characteristics by primarily using
data visualization. Examples of EDA performed are:
a.
Box Plot - Association between race and age
b.
Map of Police Shootings by State
c.
Bar Chart of Threat Level by Race
d.
Pie chart of Fatalities by race and age
e.
Bar Chart comparison of total population percentage of race vs percentage of shootings
f.
Line graph of rate of shootings over time
Modeling of the data was performed next, which is described in the following section.
ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES
6
EXPERIMENTS AND DISCUSSION/ANALYSIS
Phase 3 of the project is the analytics phase where models and algorithms were applied to the
cleaned data in order to analyze and attempt to answer the hypotheses. There was a total of 5
modeling algorithms that were applied to the data.
1. Linear Regression was used to determine whether there is a relationship between fatalities
and total area income. The typical news narrative is that lower income areas and highly
populated cities suffer from over policing and consequently, higher rates of fatalities by
police. We want to see if there is some correlation between the income of an area and the
number of fatalities in that area.
After observing the data in a scatter plot, the relationship appeared linear. Furthermore, we
would like to model the predicted number of fatalities based on income which could be used
later down the line to create a model related to population. According to the calculated R2
value (0.84584), there is a strong positive correlation. This means that there are more
fatalities if an area makes more. However, this proves that big cities do not have more police
deaths as the results are linear. If there was a higher rate of deaths in populated areas, the
graph would have an exponential relationship and this model would not be viable. Looking at
the scatter plot and the regression line, this model provides a decent approximator to
calculate total deaths by income.
Figure 1.0: Line and scatter plot showing linear model against dataset points.
ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES
7
2. Ridge Regression was used to find whether the rate of fatal shootings increased as the year
went on. 2020 and 2021 have put a spotlight on police brutality but we are aiming to verify if
the recent spotlight is something new or if it has been happening in the past. Is there media
bias surrounding police brutality that makes it appear that there have been more deaths by
police officers than ever before?
After observing the data in a scatter plot, the relationship appeared linear. Rather than
choosing a linear regression, since the data showed high multicollinearity a ridge regression
seemed better suited for the task. Separating the data frame into the 2 columns needed, the
linear regression was completed with the cumulative sum of the number of fatalities per
month (grouped by year) and the month of that cumulative sum. Next, the data was divided
into 80% for the regression training data and the remaining 20% to test. We can now evaluate
with near certain accuracy, the number of fatalities at any given point of the year.
According to the calculated R2 value (0.99808), there is a strong positive correlation. This
means that despite the media claiming a dramatic increase in police brutality and fatalities,
this number has not changed much throughout the years. In fact, with such a high correlation,
it appears that there has been almost no change as each point in the scatter plot lies extremely
close to the regression line. Looking at the scatter plot and the regression line, this model
provides a near certain approximator to calculate total deaths at any point in the year.
Figure 1.1: Line and scatter plot showing linear model against dataset points.
ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES
8
3. K-Nearest Neighbors was used in an attempt to verify if there are any discernible indicators
in identifying the race of an individual who is killed by a police officer. If there is, this would
prove some inherent or trained bias that police officers have that unfortunately affect certain
races disproportionately. The race column is removed from the normalized data and then
partitioned by 80% for the training data and 20% for the test data which will be used to check
the accuracy of the algorithm. The race column is saved in order to create a confusion matrix
to look at the K-NN predictions. Furthermore, it is transformed by using the Label Encoder
to remove any continuous features in the labels.
The accuracy of the model is 71.97% which is a standard accuracy for a K-NN algorithm,
especially with low volumes of data. Unfortunately, this data uncovers that there are
discernable characteristics between police interactions of every race. This can be seen when
looking at the confusion matrix diagonal which shows properly labeled data. This means that
the determination of threat level, being armed, the income of the area, etc., all play a part in
this distinction. Tangentially, this may show some wider form of a systemic problem such as
over-policing and lack of training of those police officers to keep up with bigger populations
and higher crime rates.
Figure 1.2: Confusion matrix showing ML model predictions against the correct label
ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES
9
4. Multinomial Naive Bayes Classification was used to determine if the people that are being
targeted or fatally shot by police officers have similar characteristics. If they do, this would
show that there is some sort of underlying bias from police officers against certain races. The
algorithm is trained by the data we provide it and will classify an individual's race based on
various individual features. If the accuracy of the model is high, it will show that certain
races have distinguishing characteristics that allow them to be classified. The model
produced an accuracy of 86%. The high model accuracy shows that there seems to be a
correlation between race and certain characteristics (having a mental illness, being armed,
having a high/low threat level, etc.). This means that the determination of threat level, being
armed, the income of the area, etc., all play a part in this distinction. This may show that
there is an underlying bias or lack of training by police officers.
ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES
10
5. K-Means Algorithm was used as a method to cluster the data and group it based on a certain
attribute. The ideal number of clusters selected was determined by the elbow method, and
was indicated to be 6. The number of races in the data is also a total of 6. This indicates that
the algorithm grouped the data based on the attribute of race. This indicates that race plays a
role in fatal shootings and suggests that discrimination could be occurring. The total variance
of the algorithm was relatively small around ~14.
Figure 1.3: Elbow method visualization for k selection
Figure 1.4: K-Means cluster visualization
ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES
11
CONCLUSION/RECOMMENDATIONS
Based on the exploratory data analysis and the modeling performed in phase 3, discrimination by
police officers is occurring however, to prove causation versus correlation in the analysis, more
comprehensive datasets would be needed. For example, in the EDA we found that the population
percentage of each race does not align with the percentage of fatal shootings for each race.
African Americans make up about 12% of the population yet were involved in 27% of fatal
shootings. The dramatic difference is unsettling because it is an indicator that there may be
discrimination occurring. However, the bar graph that shows Association Between Threat Level
and Race shows African Americans with the highest threat levels of all races, which may explain
the high percentage of fatal shootings.
Furthermore, all of the modeling techniques point toward a form of police discrimination. For
example, the K-NN and K-Means algorithm showed that there are discernable characteristics
between police interactions of every race. This can be seen when looking at the confusion matrix
diagonal which shows properly labeled data. This means that the determination of threat level,
being armed, the income of the area, etc., all play a part in this distinction. Tangentially, this may
show some wider form of a systemic problem such as over-policing and lack of training of those
police officers to keep up with bigger populations and higher crime rates. The modeling,
although helpful in providing initial indications of the data, cannot allow us to fully conclude if
there is an underlying bias by police officers against certain races.
Further analysis on the data needs to be performed by using additional data that includes
information on individuals that have not been fatally shot by police officers but have had some
sort of interaction with them. This would allow for a more complete data set. The classification
and modeling algorithms would also be more accurate as they would be able to better conclude if
discrimination based on race is occurring.
ANALYSIS OF POLICE DISCRIMINATION IN THE UNITED STATES
12
REFERENCES
Shane, J. M., Lawton, B., & Swenson, Z. (2017). The prevalence of fatal police shootings by US
police, 2015–2016: Patterns and answers from a new data set. Journal of criminal justice, 52,
101-111.
Learn. (n.d.). Retrieved March 29, 2021, from https://scikit-learn.org/stable/
Exploratory data analysis. (n.d.). Retrieved March 29, 2021, from
https://www.itl.nist.gov/div898/handbook/eda/eda.htm
Download