Uploaded by Chris Cummings

(Graphs)Lab 2 Relationships between two variables

advertisement
Lab 2 Relationships Between Two Variables
Instructions: Read through and answer or implement the instructions given below. You will
submit your answers in a lab report through Canvas. For your report, please answer the
questions in narrative form where possible and using screenshots where needed. For instance,
any graph needs a screenshot. When in doubt, give a screenshot. The lab report is best
submitted in Word® or .pdf® format in Canvas (e.g. Google Docs and Apple Numbers are not
permitted).
Goals: This lab focuses on how to summarize relationships between two-variables. As with
much of statistical analyses, one should begin by identifying the types of variables involved.
Recall the two major types of variables: Categorical (C) and Quantitative (Q). When looking at
relationships between two variables we have the following types of relationships each with their
own statistical summaries:
1. C-C: two-way tables and grouped bar graphs.
2. C-Q: side-by-side box plots and either of the following numerical summaries: mean and
standard deviation, or median and IQR for each of the categorical groups (these
numerical summaries will not be covered here).
3. Q-Q: scatterplots, correlation coefficient r (here we will only compute Spearman’s
correlation), and the least squares regression line.
The instructions below will teach you how to implement each of these and then give you practice
applying them to a novel dataset.
About the Data: The data used throughout the lab is from the U.S. Department of Agriculture
on access to food, or food desserts in Arizona. The complete data can be found here:
https://www.ers.usda.gov/data-products/food-access-research-atlas/
Below are some of the variables we will be considering:
CensusTract = Census Tract ID
State = State
County = County
Urban = 1 for Urban, 0 for other
POP2010 = Population in the tract
HUNVFlag = Flag for tract where >= 100 of households do not have a vehicle, and
beyond 1/2 mile from supermarket
PovertyRate = Poverty rate
LA1and10 = Flag for low access tract at 1 mile for urban areas or 10 miles for rural areas
MedianFamilyIncome = Median Family Income
lahunvhalfshare = Share of tract housing units that are without vehicle and beyond 1/2
mile from supermarket
TractLOWI = Total count of low-income population in tract
TractKids = Total count of children age 0-17 in tract
laseniorshalfshare = Share of tract population that are seniors beyond 1/2 mile from
supermarket
LILATracts_1And10 = Flag for food desert when considering low accessibility at 1 and
10 miles
Guided Instructions
1. Loading data
We begin this lab by first loading our data into Excel® for analysis.
Start Excel® and open “FoodAccess2015.xlsx” as you did in Lab 1.
2. Summarizing C-C relationships with two-way tables and barplots
Suppose that we wish to determine whether urban or suburban communities have better
access to food. For this we will consider the two variables Urban and HUNVFlag.
This will utilize the table command much like we did in the previous lab. Prior to doing
this, it is important to identify the independent variable, and the dependent variable. In
our case it would make sense that whether a region is Urban or not will influence
vehicular access to food, HUNVFlag. So Urban will be our independent variable and
HUNVFlag the dependent. Choose a cell to the right of the data and put in labels for
color in horizontal cells and for gender in vertical cells as follows:
If you want the box around the table as shown (this is not needed), then look near the font
choices on the command home tab for a box:
To count how often an Other tract is Accessible, we use the following command:
=COUNTIFS(D:D,R3,F:F,S2)
In this command, the D:D specifies which column to look at first and the R3 tells us to
only count the Other tracts. The F:F specifies the HUNVFlag column and S2 is for
Accessible. Adjust the R3 and the S2 as needed for your table.
Pro Tip: If the counts are not what you expected, double check the spelling of all of the
category names (Other, Accessible, etc.). If they are not exactly as they are in the dataset
itself, they will not be counted. The easiest way is to copy and paste each one.
To complete the table, repeat this process for each of the other cells in your table. Note
that there are ways to drag the formulas, but for simplicity we are not going to give those
instructions here. Here are the formulas for the screenshot above:
Next, we are going to create a new table with the percents instead of the counts. In the
row of other counts, click the cell immediately to the right of your table. A single click
of the summation button
and then clicking the Enter key computes the totals as
shown in the screenshot above.
Choose a cell a row or two below your table of counts and put in labels for color in
horizontal cells and for gender in vertical cells as you did for the counts above. The table
should have the same headings as the one above, but we will replace the counts with their
corresponding percents.
In the upper left corner of this new table, type an “=” and then click on the upper left
corner of the table of counts, click the “/” key, and then click the sum for that row.
Repeat this process to get the following table of percents:
Here are the formulas for the percents table in the above screenshot:
Highlight the table of percents and then insert a Clustered Column graph.
Make the chart easy to read by adding a chart title and a vertical axis label. You can also
change the colors of the bars to suit your taste.
3. Summarizing C-Q relationships with side-by-side boxplots
We have already constructed boxplots in the previous lab (see Boxplots.xlsx), now we
need to modify the instructions to indicate we want to break the data into groups based on
the categorical variable. So suppose that we are interested in whether seniors have better
access to food in urban settings than in other settings. We can consider the variables
Urban and laseniorshalfshare, which is the fraction of the population that are seniors
over a ½ mile from a supermarket.
To do this we are going to put information in two new columns in Excel. If you did not
leave columns O and P blank, then insert two new columns immediately to the right of
column N. Put the labels Senior-Other and Senior-Urban in the first row of columns O
and P.
The command to transfer only the fraction of seniors of Other respondents is
IF(D2="Other", M2,"")
The D2 is the cell in this tract’s row that gives the setting and the M2 is fraction of
seniors without easy access to a supermarket.
While you could type a formula in every single cell, it would take you quite a while.
There is a faster alternative. If you will put your mouse directly over the lower right
corner of the O2 cell (the cursor switches to a thin plus), you can then double click it and
fill that formula down to the end of the data. You can also simply click, hold, and pull it
down to the end of the list. Repeat this for the ages of males.
Make a chart as shown and then copy and paste it into the Double Boxplot tab.
You should obtain the boxplot below (with title and axis labels).
4. Summarizing Q-Q relationships (scatterplots, r and regression lines)
Suppose we wanted a scatterplot of the number children in a tract and the median family
income in that tract. Click on the Column Letter (F) above the Height data as your x
variable. Hold the control key down and then click on the Column Letter (G) above the
Handspan data as your y variable.
Insert a scatterplot of these two columns:
If you click on the graph, then the tool bar adds two tabs. Click the Design tab and then
Add Chart Element to select additional items. You can also double click on the Title at
the top of the graph and fix it too.
Find the correlation between the number of children and the median family income by
using the command.
=CORREL(G:G,I:I)
This is Pearson’s correlation coefficient r for these two columns. Is this r2 high enough
that we would think it a good idea to draw a regression line?
Finally, we will construct the regression line for this data and add it to the plot (whether
or not this would have been a good idea). Right click any single data point on the graph
and you will be given the option to add a Trendline:
Be sure to select “Display Equation” on the Format Trendline menu that pops up.
If you right click the new trendline, you can change the color to make it stand out. If
you click and hold the equation, you can move it to somewhere on the graph that can
be read.
Note that the second value in the equation (62490) represents the y-intercept, labeled as a
in the regression formula and the other numerical value (1.687) beside the x, is the slope
of the regression line, labeled b in the regression equation. In this case the output tells us
that the equation of our regression line is
MedianFamilyIncome = 62490.436 + -1.687 * Kids
Application Questions
Using the food desert data answer each of the following answer each of the following questions
using the appropriate graphical AND numerical summary (no numerical summary required for
C-Q relationships). Submit your answers to each question including the summary or summaries,
the commands you used and 1-2 sentences interpreting the results.
A1. What is the relationship between PovertyRate and TractKids?
A2. What is the relationship between food desert’s (LILATracts_1And10) and PovertyRate?
A3. What is the relationship between vehicle access (HUNVFlag) and the location of food
deserts (LILATracts_1And10)?
A4. What is the relationship between whether a tract is classified as a food desert
(LILATracts_1And10) and the number of kids (TractKids)?
A5. Collaboration Questions
As part of your course work you should have begun working on the Relationships Between Two
Variables Handout looking at the behavior of correlation coefficients and regression lines as the
points have various attributes. If you have not yet done so, finish the Handout. Get in touch
with two other people in the course, list their names here, and come up with a description of (1)
the effects outliers have on the correlation coefficient and (2) the reasons a correlation coefficient
may be close to 0.
Download