Worksheet 4: Histograms, Box plot, and Scatter plot Learning Objectives: open and close a project load a shape file select functions from the menu or toolbar. create a histogram for a variable change the number of categories depicted in the histogram create a regional histogram create a box plot for a variable Activities: This exercise will introduce a tool for exploratory Data Analysis, called GeoDA. The open source version of this tool is available from VLABS, under \\UP.IST.LOCAL\VA\data\GeoDa\OpenGeoDa.exe Step 1 Getting to know menus and toolbars. Start OpenGeoDa from the folder: \\UP.IST.LOCAL\VA\data\GeoDa\OpenGeoDa.exe Go to File menu and choose “Open Shape File.” Navigate to the directory: \\UP.IST.LOCAL\VA\data\GeoDa\Data\ and choose to open the St. Louis homicide sample data set for 78 counties surrounding the St. Louis metropolitan area (stl hom.shp). Explore the menus and toolbars for a moment. Step 2. Creating quantile maps and view tables. Close all windows of OpenGeoDa, and then open the shape file SIDs2.shp. The SIDS data set Contains variables for the count of SIDS deaths for 100 North Carolina counties in two time periods, here labeled SID74 and SID79. In addition, there are the count of births in each county (BIR74, BIR79) and a subset of this, the count of non-white births (NWBIR74, NWBIR79). Construct two quantile maps to compare the spatial distribution of non-white births and SIDS deaths in 74 (NWBIR74 and SID74). Click on the base map to make it active (in GeoDa, the last clicked window is active). In the Map Menu, select Quantile. A dialog will appear, allowing the selection of the variable to be mapped. In the Variables Settings dialog, select NWBIR74 and keep the number classes to be the default of 4. [Question] What does the number on the right of the legend (in parentheses) mean? Why they are all 25? Use the cursor to drag a box to select a few counties. Then try to view the attribute table by clicking menu: Table Move Selected to Top. Try different selection from either the table view or the map view. Observe how selections on the two views are linked. Next, apply a range selection (from table menu), so that BIR74 is in the range of <0, 500>. Show your result of the table and map. Step 3. Histogram With the map view open, invoke the histogram as Explore > Histogram from the menu. In the variable settings dialogue, choose “NWBIR74”. The result is a histogram with the variables classified into 7 categories. You may change the number of categories by going to the “Options” menu and “Intervals”. Show a histogram when the interval is set to 9. Step 4. Boxplot Clear all windows and start a new project using the stl hom.shp homicide sample data set. Invoke the box plot by selecting Explore > Box Plot from the menu, or by clicking on the Box Plot toolbar icon. Next, choose the variable HR8893 (homicide rate over the period 1988–93) in the dialog. Click on OK to create the box plot. Specific observations in the box plot can be selected in the usual fashion, by clicking on them, or by click-dragging a selection rectangle. The selection is immediately reflected in all other open windows through the linking mechanism. While you have the table and base map open for the St. Louis data, select the outlier observations in the box plot by dragging a selection rectangle around them. Show where are the outliers on the map view. Step 5. Scatter plot Invoke the scatter plot functionality from the menu, as Explore > Scatter Plot. In the Dialogue, select HR7984 (the county homicide rate in the period 1979–84) in the left column as the y variable and RDAC80 (a resource deprivation index constructed from census variables) in the right column as the x variable. Click on OK to bring up the basic scatter plot. The scatter plot in GeoDa has two useful options. They are invoked by selection from the Options menu or by right clicking in the graph. While having the scatter plot open, bring up the options menu and choose Scatter plot > Standardized data. This converts the scatter plot to a correlation plot, in which the regression slope corresponds to the correlation between the two variables (as opposed to a bivariate regression slope in the default case). The variables on both axes are rescaled to standard deviational units, so any observations beyond the value of 2 can be informally designated as outliers.