lab_8 - Center for Applied Spatial Analysis

advertisement
RNR 416F-516F – Lab 8: Logistic Regression
During lab six, you discovered that there was some clustering of the payday loan centers,
although the reason for the clustering was less clear. You are wondering if it is evidence
of targeting particular communities. To test this, you decide to create a regression model
that you hope will answer the question. One drawback to this week’s tasks, your
assistant, Channing, is unable to help you. After that little accident last week she is
taking time off. Because of her laziness, you will be forced to do all the work yourself.
Your resentment grows by the minute.
Commands
In order to complete your tasks, you will need to use a variety of tools. The following is
a short list of tools you might find helpful for your work today. Remember to try to work
through the help files before you ask for assistance.
Tools
Module
Toolbox
Toolbox
Toolbox
Toolbox
Toolbox
Copy
Create Random Points
Append
Identity
Join Field
Processing Steps
This lab will be handled in five sections. 1) data preparation; 2) logistic regression in
SPSS; 3) making the regression model; 4) data analysis; and 5) story telling.
Data Preparation
In this section, you will examine the geodatabase and get it ready for the work you will
be doing.
Your goal is to have a point feature class that has the study and the control groups
identified by ones and zeros, and contains all the demographic data from the tracts feature
class. There are many ways to get to this point. You can find your own way, or you can
follow the instructions below to do it as I did it. One thing you should note is that you
will all get different results in this lab because you are all creating your own, unique
random sample.
1. Add a field called pdl to the PLC feature class. You will use this field to
differentiate between the study and the control groups.
2. Do the records in the PLC feature class represent the study or the control group?
Based on your answer to this question, calculate the appropriate value into this
field. If you are unsure, ask your neighbor. That way you can be unsure together.
3. Create a feature class of 1000 random points. Call the output feature class
ran_sam. Make sure you set the constraining extent to tracts. This is a point
where students often make a mistake, so before you continue make sure your new
feature class actually has 1000 points
4. Add a field called pdl to the ran_sam feature class. You will use this field to
differentiate between the study and the control groups.
5. Do the records in the ran_sam feature class represent the study or the control
group? Based on your answer to this question, calculate the appropriate value
into this field. If you are unsure, ask your neighbor. That way you can be wrong
together.
6. For the regression, you will need to have the study and control groups in the same
feature class. Copy PLC to a new feature class. Call this feature class
regression_points.
7. Append ran_sam to regression_points. Make sure to set the schema type to no
test. Check the table for regression_points. You should have 1071 records –
1000 zeros and 71 ones in the pdl field.
8. Although you were consumed with jealousy that Channing was able to lie around
watching TV last night (Game of Thrones probably) you spent the evening getting
the tracts ready to create a regression model. You used census attributes to create
variables that match the Payday Loan Industry’s published strategy for locating
PLC stores. Check the Tracts feature class to make sure that it contains the
correct fields:






Percent household income between $25K and $50K
Percent households with children
Percent female head of household
Percent high school diploma only
Percent age less than 45
Percent renter occupied housing
You should also find empty fields ready to receive values weighted by regression
coefficients, as well as a field called model, and one called prob_mod.
9. Remembering how much work it was to collect these data and create the
appropriate fields, you decide to make a copy of this feature class – just in case
something bad happens to it. Call it safety_tracts
Gary L. Christopherson – Revised 10/28/2014
2
10. Finally, use an overlay operation to attach data from tracts to regression points.
The output should be called regression_data.
Logistic Regression in SPSS
Now that your data is in good shape, you are ready to perform a logistic regression. You
will use SPSS to do the regression.
1. In ArcMAP, export your table, regression_data, as a dbase file. Call it
regression_data.dbf.
2. Open SPSS
3. In SPSS open regression_data.dbf.
4. To begin the regression, follow the menu to Analyze > Regression > Binary
Logistic.
5. This should start a dialog where you will enter dependent and independent
variables. Your dependent variable will be the field with the ones and zeros in it.
The independent variables will be the six demographic data fields identified
above.
6. Examine the results of the regression, particularly the R2 value and the
significance of the coefficients. What do these results suggest about the
relationship between payday loan centers and the socio economic variables used
in the regression?
7. If you like, you should also feel free to experiment a bit. You might want to try
some of the stepwise options to see if you can improve your model.
8. Once you are done playing, export the results of the regression to a Word
document named regression_results.doc. Check to make sure your results were
exported and are available to you, then close SPSS
Making the Regression Model
Based on the regression results, you are now ready to create the regression model in
ArcGIS. Your goal is to apply the model created by your regression to the tracts feature
class. This means a feature class with both unweighted and weighted variables, an
attribute containing the predicted values of the model, and an attribute where those values
have been scaled between zero and one. Feel free to arrive at this goal on your own
terms, or follow the steps below and arrive at it on my terms.
One thing that I want you to be sure to do – we are testing the public strategy of the
payday loan industry, so even if you were able to improve on the model using stepwise
regressions, please make the model using all six of their variables.
Gary L. Christopherson – Revised 10/28/2014
3
1. Create weighted variables in the six weight fields by multiplying the unweighted
fields by their corresponding regression coefficients. These coefficients are found
in the results of the SPSS logistic regression. Mine looks like the following, but
yours will look different because you used your own unique random sample
points. In SPSS, the coefficients are called beta coefficients, and listed in the
column headed B. Don’t forget to look in Excel for the complete coefficient
value, not the truncated value found in the SPSS table.
Variables in the Equation
B
Step
1a
perc_hh_in
.051
S.E.
.021
perc_hh_w_
-.064
.016
15.154
1
.000
.938
perc_fem_h
.136
.026
28.031
1
.000
1.146
perc_hs_di
-.583
.221
6.936
1
.008
.558
perc_age_L
.009
.019
.231
1
.631
1.009
perc_rente
.036
.008
18.915
1
.000
1.037
-5.187
.735
49.845
1
.000
.006
Constant
Wald
6.008
df
1
Sig.
.014
Exp(B)
1.052
2. Use a calculator to correct the constant. Because the study group and control
groups are different sizes, the constant must be corrected. The following equation
n
will correct the constant: '    ln 2 n , where  is the Y-intercept in the
 1 
regression, ’ the corrected intercept, ln is a natural logarithm, n1 the number of
cases in the smaller sample (the study group = 71) and n2 the number of cases in
the larger sample (the control group = 1000). (Warren 1990)
3. To make the model, the weighted variables and the corrected constant are
summed. The equation to create the model would be as follows:
corrected constant + weighted variable 1 + weighted variable 2 + weighted
variable 3 + weighted variable 4 + weighted variable 5 + weighted variable 6.
Use the field calculator to perform this equation and place the results in the field
called model.
4. The last field you need to fill in is the prob_mod field. The values in a probability
model need to be scaled between 0 and 1. You will apply a logistic
transformation to the values in the model by using the following equation in the
Field Calculator:

1 / (1 + exp(- model)) (Kvamme 1988)
Gary L. Christopherson – Revised 10/28/2014
4

Check to make sure the transformation was successful by examining the min
and max values in prob_mod field – they should be very close to 0 and 1.
Model Assessment
1. In order to test the efficiency of your model, you need to determine what
percentage of the project area falls into the most likely designation, and how
many of the PLC stores are contained in the most likely area.
2. First create a binary model based on values in the prob_mod field. Create a field
call bin_mod. Then select all the records where values in the prob_mod field are
>= to 0.5. Calculate bin_mod for these records = 1. All other values in bin_mod
should be calculated to 0.
3. Next determine the percent of the area predicted by the model, that is all the
polygons with a bin_mod value of 1.


In the table, summarize on bin_mod, and sum the shape_area field.
Add this table to TOC in your map
Remember that because
document; open it and look at the
some of the tracts are out of
fields. Using the values in the table
the model because of null
calculate the percent of the total area
values, you cannot use the
that the model is telling you “is a
shape area in clippit to
good place to locate PLC stores.”
calculate percent area.
4. Next determine the percent of PLC stores
that are found in the polygons predicted to contain PLC stores. To do this, use
Identity to attach values from Tracts to the features in PLC. Call the output
PLC_bin. In this new feature class, use the values in the bin_mod field to
calculate a percentage of stores contained in polygons predicted to contain PLC
stores.
5. Finally, to determine the efficiency of your model, plug the appropriate
percentages into this equation:



 percentage of total area within most likely category 

efficiency  1  
 percentage of total sites within most likely category 
(Kvamme 1988)
Include this equation and its result in your PowerPoint document (below).
Based on this equation, do you think there is a strong, moderate, or weak
relationship between these stores and the variables identified by the PDL
industry?
Channing Alert!!
Just before you started creating a presentation to explain the results of your analysis, you
heard your email alert and found this message from Channing, your assistant.
Gary L. Christopherson – Revised 10/28/2014
5
Dear Boss,
I have been lying here for a week now. It is
incredibly boring, and I am having a very
difficult time with the whole process. Eating
through a straw, not being able to move, and I
have had a tormenting itch on my nose for the
last five days and no way to scratch it.
Thankfully, there is a computer monitor on the
ceiling, and somebody to help me use it (hi, my
name is doug, and I’m channing’s assistant. I’m
typing this email as she dictates it). I’ve been
reading PDFs on the monitor about the Payday Loan
Industry and ran across one that suggests that
their stores are located by targeting non-white
residents and high-density population. I thought
you might want to run a regression to test this
so I’ve created a table that will allow you to
calculate non-white populations in Tucson. I’ve
attached the file to this email. Let me know how
it goes.
Right now Doug, my assistant, is telling me that
the hand-truck is on the way so he can take me
for a walk. When we get back I’ll check my email
to find out how things went.
Good luck,
Your assistant, Channing
You don’t know if this is good, or bad, but you do know that you will be spending more
time doing regressions. One good thing, since you have done this before it should go
much quicker this time.
1. Find Channing’s table and open it in ArcGIS.
2. Add a new double precision field called perc_non_white. Using the variables in
the table calculate percent non-white for Tucson tracts.
3. Use the Join Field tool to add perc_non_white to the tracts feature class.
4. In the tracts feature class, add a double precision field called pop_density. Use
the field calculator to calculate population per square kilometer.
5. Add additional double precision fields called wght_perc_nw, wght_pop_dens,
new_model, new_prob_mod; and a short integer field called new_bin_mod.
6. Use Identity to attach tract variables to the PLC feature class. Call the results
regression_data_2 and export the table so you can use it in SPSS.
7. Perform a logistic regression using just perc_non_white and pop_density. Be sure
to export the results of the regression
Gary L. Christopherson – Revised 10/28/2014
6
8. Note the R2 value, and create a predictive model using the coefficients from the
regression.
9. Using the new_prob_mod, determine the ones and zeros for bin-mod.
10. Calculate percent area and percent of stores that fall in the model’s most likely
designation.
11. Calculate model efficiency.
Data Analysis/Story Telling
1. What do you think this regression means in terms of our question, “Are PLC
stores targeting particular communities?”
2. Make a PowerPoint that tells your story of PLC stores in Tucson and the payday
loan industry.
Deliverables
Just after you finished your presentation, you were cc’d on an email that Channing’s
assistant, Doug, sent to Channing. In it he apologized for not being able to help her use
the ceiling monitor to read today. He is not feeling well. Doug sent this video by way of
explanation.
https://www.youtube.com/watch?v=X8Rh3gQEtLE
Based on this video, you are struck with how perfect they are for each other, and feel
certain that a wedding is in their future.
This lab produces several deliverables:
 A GDB called lab_8.gdb – containing one feature dataset containing all the
feature classes you used and created in this lab.
 Lab_8.mxd that includes the feature classes and tables necessary to answer the
question.
 A PowerPoint presentation that tells your story about how the Payday Loan
Industry does or does not target specific communities
References Cited
Kvamme, Kenneth L.
1988 Development and Testing of Quantitative Models. In Quantifying the Present and
Predicting the Past: Theory, Method, and Application of Archaeological
Predictive Modeling. W.J. Judge and L. Sebastian, eds. Pp. 325-428. Washington,
D. C.: U.S. Government Printing Office.
Gary L. Christopherson – Revised 10/28/2014
7
Warren, Robert E.
1990 Predictive Modeling of Archaeological Site Location: A Case Study in the
Midwest. In Interpreting Space: GIS and Archaeology. K.M.S. Allen, S.W.
Green, and E.B.W. Zubrow, eds. Pp. 201-215. London: Taylor and Francis.
Gary L. Christopherson – Revised 10/28/2014
8
Download