Lab 2 Relationships Between Two Variables Instructions: Read through and answer or implement the instructions given below. You will submit your answers in a lab report through Canvas. For your report, please answer the questions in narrative form where possible and using screenshots where needed. For instance, any graph needs a screenshot. When in doubt, give a screenshot. The lab report is best submitted in Word® or .pdf® format in Canvas (e.g. Google Docs and Apple Numbers are not permitted). Goals: This lab focuses on how to summarize relationships between two-variables. As with much of statistical analyses, one should begin by identifying the types of variables involved. Recall the two major types of variables: Categorical (C) and Quantitative (Q). When looking at relationships between two variables we have the following types of relationships each with their own statistical summaries: 1. C-C: two-way tables and grouped bar graphs. 2. C-Q: side-by-side box plots and either of the following numerical summaries: mean and standard deviation, or median and IQR for each of the categorical groups (these numerical summaries will not be covered here). 3. Q-Q: scatterplots, correlation coefficient r (here we will only compute Spearman’s correlation), and the least squares regression line. The instructions below will teach you how to implement each of these and then give you practice applying them to a novel dataset. About the Data: The data used throughout the lab is from the U.S. Department of Agriculture on access to food, or food desserts in Arizona. The complete data can be found here: https://www.ers.usda.gov/data-products/food-access-research-atlas/ Below are some of the variables we will be considering: CensusTract = Census Tract ID State = State County = County Urban = 1 for Urban, 0 for other POP2010 = Population in the tract HUNVFlag = Flag for tract where >= 100 of households do not have a vehicle, and beyond 1/2 mile from supermarket PovertyRate = Poverty rate LA1and10 = Flag for low access tract at 1 mile for urban areas or 10 miles for rural areas MedianFamilyIncome = Median Family Income lahunvhalfshare = Share of tract housing units that are without vehicle and beyond 1/2 mile from supermarket TractLOWI = Total count of low-income population in tract TractKids = Total count of children age 0-17 in tract laseniorshalfshare = Share of tract population that are seniors beyond 1/2 mile from supermarket LILATracts_1And10 = Flag for food desert when considering low accessibility at 1 and 10 miles Guided Instructions 1. Loading data We begin this lab by first loading our data into Excel® for analysis. Start Excel® and open “FoodAccess2015.xlsx” as you did in Lab 1. 2. Summarizing C-C relationships with two-way tables and barplots Suppose that we wish to determine whether urban or suburban communities have better access to food. For this we will consider the two variables Urban and HUNVFlag. This will utilize the table command much like we did in the previous lab. Prior to doing this, it is important to identify the independent variable, and the dependent variable. In our case it would make sense that whether a region is Urban or not will influence vehicular access to food, HUNVFlag. So Urban will be our independent variable and HUNVFlag the dependent. Choose a cell to the right of the data and put in labels for color in horizontal cells and for gender in vertical cells as follows: If you want the box around the table as shown (this is not needed), then look near the font choices on the command home tab for a box: To count how often an Other tract is Accessible, we use the following command: =COUNTIFS(D:D,R3,F:F,S2) In this command, the D:D specifies which column to look at first and the R3 tells us to only count the Other tracts. The F:F specifies the HUNVFlag column and S2 is for Accessible. Adjust the R3 and the S2 as needed for your table. Pro Tip: If the counts are not what you expected, double check the spelling of all of the category names (Other, Accessible, etc.). If they are not exactly as they are in the dataset itself, they will not be counted. The easiest way is to copy and paste each one. To complete the table, repeat this process for each of the other cells in your table. Note that there are ways to drag the formulas, but for simplicity we are not going to give those instructions here. Here are the formulas for the screenshot above: Next, we are going to create a new table with the percents instead of the counts. In the row of other counts, click the cell immediately to the right of your table. A single click of the summation button and then clicking the Enter key computes the totals as shown in the screenshot above. Choose a cell a row or two below your table of counts and put in labels for color in horizontal cells and for gender in vertical cells as you did for the counts above. The table should have the same headings as the one above, but we will replace the counts with their corresponding percents. In the upper left corner of this new table, type an “=” and then click on the upper left corner of the table of counts, click the “/” key, and then click the sum for that row. Repeat this process to get the following table of percents: Here are the formulas for the percents table in the above screenshot: Highlight the table of percents and then insert a Clustered Column graph. Make the chart easy to read by adding a chart title and a vertical axis label. You can also change the colors of the bars to suit your taste. 3. Summarizing C-Q relationships with side-by-side boxplots We have already constructed boxplots in the previous lab (see Boxplots.xlsx), now we need to modify the instructions to indicate we want to break the data into groups based on the categorical variable. So suppose that we are interested in whether seniors have better access to food in urban settings than in other settings. We can consider the variables Urban and laseniorshalfshare, which is the fraction of the population that are seniors over a ½ mile from a supermarket. To do this we are going to put information in two new columns in Excel. If you did not leave columns O and P blank, then insert two new columns immediately to the right of column N. Put the labels Senior-Other and Senior-Urban in the first row of columns O and P. The command to transfer only the fraction of seniors of Other respondents is IF(D2="Other", M2,"") The D2 is the cell in this tract’s row that gives the setting and the M2 is fraction of seniors without easy access to a supermarket. While you could type a formula in every single cell, it would take you quite a while. There is a faster alternative. If you will put your mouse directly over the lower right corner of the O2 cell (the cursor switches to a thin plus), you can then double click it and fill that formula down to the end of the data. You can also simply click, hold, and pull it down to the end of the list. Repeat this for the ages of males. Make a chart as shown and then copy and paste it into the Double Boxplot tab. You should obtain the boxplot below (with title and axis labels). 4. Summarizing Q-Q relationships (scatterplots, r and regression lines) Suppose we wanted a scatterplot of the number children in a tract and the median family income in that tract. Click on the Column Letter (F) above the Height data as your x variable. Hold the control key down and then click on the Column Letter (G) above the Handspan data as your y variable. Insert a scatterplot of these two columns: If you click on the graph, then the tool bar adds two tabs. Click the Design tab and then Add Chart Element to select additional items. You can also double click on the Title at the top of the graph and fix it too. Find the correlation between the number of children and the median family income by using the command. =CORREL(G:G,I:I) This is Pearson’s correlation coefficient r for these two columns. Is this r2 high enough that we would think it a good idea to draw a regression line? Finally, we will construct the regression line for this data and add it to the plot (whether or not this would have been a good idea). Right click any single data point on the graph and you will be given the option to add a Trendline: Be sure to select “Display Equation” on the Format Trendline menu that pops up. If you right click the new trendline, you can change the color to make it stand out. If you click and hold the equation, you can move it to somewhere on the graph that can be read. Note that the second value in the equation (62490) represents the y-intercept, labeled as a in the regression formula and the other numerical value (1.687) beside the x, is the slope of the regression line, labeled b in the regression equation. In this case the output tells us that the equation of our regression line is MedianFamilyIncome = 62490.436 + -1.687 * Kids Application Questions Using the food desert data answer each of the following answer each of the following questions using the appropriate graphical AND numerical summary (no numerical summary required for C-Q relationships). Submit your answers to each question including the summary or summaries, the commands you used and 1-2 sentences interpreting the results. A1. What is the relationship between PovertyRate and TractKids? A2. What is the relationship between food desert’s (LILATracts_1And10) and PovertyRate? A3. What is the relationship between vehicle access (HUNVFlag) and the location of food deserts (LILATracts_1And10)? A4. What is the relationship between whether a tract is classified as a food desert (LILATracts_1And10) and the number of kids (TractKids)? A5. Collaboration Questions As part of your course work you should have begun working on the Relationships Between Two Variables Handout looking at the behavior of correlation coefficients and regression lines as the points have various attributes. If you have not yet done so, finish the Handout. Get in touch with two other people in the course, list their names here, and come up with a description of (1) the effects outliers have on the correlation coefficient and (2) the reasons a correlation coefficient may be close to 0.