Introduction to Geographic Information Systems Fall 2013 (INF 385T-28620) Dr. David Arctur Research Fellow, Adjunct Faculty University of Texas at Austin Lecture 8 October 17, 2013 Geocoding Outline Geocoding overview Polygon geocoding Linear (street) geocoding Problems and solutions Geocoding layer sources Geocoding in ArcGIS INF385T(28620) – Fall 2013 – Lecture 8 2 Overview Process of creating geometric representations for locations (such as points) from descriptions of locations (such as street addresses) Uses a computer program called a geocoding engine that employs code tables and rules to standardize address components INF385T(28620) – Fall 2013 – Lecture 8 3 Examples City’s economic development department Maps technology businesses by street address to determine technology-rich areas in a city Hospital Maps patients to determine where to open a satellite clinic Emergency dispatch Maps callers’ addresses to determine who should respond to an emergency Retail store chain Maps store and customer locations, and compares to mapped competitor locations Others? INF385T(28620) – Fall 2013 – Lecture 8 4 Tabular data Text file or database Street addresses ZIP Codes INF385T(28620) – Fall 2013 – Lecture 8 5 Geocoding reference layers Street centerlines ZIP Code polygons INF385T(28620) – Fall 2013 – Lecture 8 6 Lecture 8 POLYGON GEOCODING ZIP Code geocoding Method to map data whose geocode is for a polygon Assign each record to its polygon Count the records for each polygon Join the table to the corresponding polygon layer Symbolize using a choropleth map or graduated point symbols INF385T(28620) – Fall 2013 – Lecture 8 8 ZIP Code geocoding INF385T(28620) – Fall 2013 – Lecture 8 9 ZIP Code geocoding Points created at ZIP Code centroids INF385T(28620) – Fall 2013 – Lecture 8 10 ZIP Code geocoding Points (attendees) spatially joined to ZIP Code polygons INF385T(28620) – Fall 2013 – Lecture 8 11 ZIP Code geocoding Choropleth map created INF385T(28620) – Fall 2013 – Lecture 8 12 Lecture 8 LINEAR (STREET) GEOCODING Linear geocoding (streets) TIGER (Census Bureau) street maps Four street address numbers, low to high for each side of a street segment 100 101 INF385T(28620) – Fall 2013 – Lecture 8 Oak Street 198 199 14 Address components Number Street name Street type Direction, suffix Direction, prefix Unit number Zone, city Zone, ZIP Code 125 Oak St E, Apt. 2, Pittsburgh, PA 15213 125 Oak St E, Apt. 2, Pittsburgh, PA 15213 125 Oak St E, Apt. 2, Pittsburgh, PA 15213 125 Oak St E, Apt. 2, Pittsburgh, PA 15213 125 E Oak St, Apt. 2, Pittsburgh, PA 15213 125 Oak St E, Apt. 2, Pittsburgh, PA 15213 125 Oak St E, Apt. 2, Pittsburgh, PA 15213 125 Oak St E, Apt. 2, Pittsburgh, PA 15213 Items for single-number street address: Address Unit City ZIP Code 125 Oak St E Apt. 2 Pittsburgh 15213 INF385T(28620) – Fall 2013 – Lecture 8 15 Street Intersections Put intersections in address field Forbes AV & Craig ST Grant ST & 5th AV E North Star RD & Duncan AV Do not include street numbers 3999 Forbes Ave & 100 Craig ST Connectors Any unusual character (e.g., &, @, |) Just be consistent 16 Geocoding Flowchart Input Address Parse Address Yes Score Matches Matches ? No Output No match Generate Soundex Key Best match >= 90? Find Candidates: No Range & Soundex Key INF385T(28620) – Fall 2013 – Lecture 8 No Yes Output Address 17 Geocoding steps Original address: 125 East Oak Street 15213 Address parsed: |125|East|Oak|Street|15213 Abbreviations standardized: |125|E|Oak|St|15213 Elements assigned to match keys: [HN]:125 [SN]:Oak[ST]:St [SD]:E [ZP]:15213 Index values calculated: [HN]:125 [SN]:Oak(Soundex #) [ST]:St [SD]:E [ZP]:15213 (Index #) INF385T(28620) – Fall 2013 – Lecture 8 18 Soundex index Matches names based on how they sound (if indices match) Translates names to a 4-digit index of 1 letter and 3 numbers First character of name remains unchanged Adjacent letters in the name which have the same Soundex key are assigned a single digit If the end of the name is reached before filling 3 digits, use zeros to complete the code Oake = O-200, Oak = O-200 Smith = S-530, Smythe = S-530 Paine = P-500, Payne = P-500 Callahan = C-450, Calahan = C-450 Key Letters 1 bfpv 2 cgjkqs xz 3 dt 4 l 5 mn 6 r disregard aehio uyw Beadles = B-342, Beattles = B-342 Schultz = S-243, Shults = S-432 http://www.sconsig.com/sastips/soundex-01.htm http://www.archives.gov/research/census/soundex.html 19 Scoring candidates Use a rule base to score source and reference matches Start with score of 100 Subtract points for each mismatch Examples from rule base Soundex indices match but street names do not (-2) Street type missing in source (-1) Street types do not match (-2) INF385T(28620) – Fall 2013 – Lecture 8 20 Candidate streets Candidates identified: 125 East Oak Street 15213 From To Street Type Side Parity Direction Street_ 2 98 Oak St R E W 4344 1 99 Oak St L O W 4345 100 198 Oak St R E E 4346 101 199 Oak St L O E 4357 Candidates scored and filtered: From To Street Type Side Parity Direction Street_ 100 198 Oak St R E E 4346 101 199 Oak St L O E 4357 INF385T(28620) – Fall 2013 – Lecture 8 21 Address matched as point Best candidate matched From To Street Type Side Parity Direction Street_ 101 199 Oak St L O E 4357 Oak St 2 98 100 198 1 99 101 199 Pine Ave 125 INF385T(28620) – Fall 2013 – Lecture 8 22 Lecture 8 PROBLEMS AND SOLUTIONS Possible problems Variations in street names Fifth Avenue, Fifth Ave., 5th AV Saw Mill Run Blvd, Route 51 Data entry errors Fidth Avenue Sawmill Run Place names White House, Heinz Field, Empire State Building Intersections Fifth Avenue and Craig Street INF385T(28620) – Fall 2013 – Lecture 8 24 Possible problems Zones 100 Main ST 15101, 100 Main ST 16202 P.O. boxes P.O. Box 125 Missing street data INF385T(28620) – Fall 2013 – Lecture 8 25 Solutions Clean data before geocoding Purchase or build high-quality maps (field verification) Use postal address standards Assign house numbers in rural areas Use alias tables Alias Address White House 1600 Pennsylvania Avenue Heinz Field 100 Art Rooney Avenue Empire State Building 350 5th Ave INF385T(28620) – Fall 2013 – Lecture 8 26 Alias table Alias Address CMU 5000 Forbes Av Carnegie Mellon 5000 Forbes Av Carnegie Mellon U 5000 Forbes Av Carnegie Mellon Univ 5000 Forbes Av Carnegie Mellon University 5000 Forbes Av Etc. INF385T(28620) – Fall 2013 – Lecture 8 27 Lecture 8 GEOCODING LAYER SOURCES US Census TIGER files Digitized from 1:100,000 scale maps Pros: Free and easy to download Uniform across jurisdictional lines (nationally) Street address formatting works well with standard GIS geocoding capacities Cons: Incomplete data Placement of address point is approximate INF385T(28620) – Fall 2013 – Lecture 8 29 TIGER line attribute table Census street centerlines extracted from lines that make up census boundaries tl_2009_04013_edges.shp "FEATCAT" = 'S' INF385T(28620) – Fall 2013 – Lecture 8 30 MAF/TIGER Master Address File / Topologically Integrated Geographic Encoding and Referencing MAF is a complete inventory of housing units and businesses in the United States and its territories TIGER is a collection of lines as we know it MAF produces mail-out census forms and ACS random samples MAF/TIGER produces maps for on-the-ground census takers MAF is confidential TIGER 2009 and newer have much improved positional accuracy INF385T(28620) – Fall 2013 – Lecture 8 31 US Census ZIP Codes ZIP Code Tabulation Areas (ZCTAs) Approximations for census purposes Do not reflect actual ZIP Code areas and are not kept up to date INF385T(28620) – Fall 2013 – Lecture 8 32 Local jurisdictions Parcel address points Pros: Accurate placement of residential location (parcel positional data is often very good; e.g., +/- 5 meters or less) Cons: May need to contact individuals within agencies to get most up-to-date data May not be available, or may cost a substantial amount of money Data ends at jurisdictional boundaries Data files tend to be very large INF385T(28620) – Fall 2013 – Lecture 8 33 Local jurisdictions Street centerlines Pros: Potential to be more up to date (often yearly updates, sometimes quarterly) Often accuracy adequate to meet city infrastructure needs (typically +/- 10 meters or less) Cons: May need to contact individuals within agencies to get most up-to-date data Data ends at jurisdictional boundaries INF385T(28620) – Fall 2013 – Lecture 8 34 Private vendors StreetMap USA National dataset (US and Canada) Address locators prebuilt, can geocode across the United States GDT Dynamap/2000 US street data Small fee for individual ZIP Code layers. Map layers are the highest quality street map layers in terms of appearance, completeness, and accuracy. More than one million changes every quarter Maps include more than 14 million US street segments and include postal boundaries, landmarks, water features, and other features INF385T(28620) – Fall 2013 – Lecture 8 35 Online geocoding ArcGIS.com, Google, GeoCommons, Maptive, etc. Pros: Fast and easy to access Free or inexpensive Cons Loss of privacy/confidentiality Accuracy Usability in desktop GIS INF385T(28620) – Fall 2013 – Lecture 8 36 Lecture 8 GEOCODING IN ARCGIS Create address locator ArcCatalog INF385T(28620) – Fall 2013 – Lecture 8 38 Choose address locator style Skeleton of the address locator Based on data tables and reference layer INF385T(28620) – Fall 2013 – Lecture 8 39 Address locator styles Style Reference dataset geometry Address Reference dataset search representation parameters Example Applications US Address— Lines Dual Ranges Address range for All address 320 Madison St. Finding a house both sides of street elements in a N2W1700 County Rd. on a specific side segment single field 105-30 Union St. of the street US Address— Points or Single House polygons Each feature represents an address INF385T(28620) – Fall 2013 – Lecture 8 All address 71 Cherry Ln. elements in a W1700 Rock Rd. single field 38-76 Carson Rd. Finding parcels, buildings, or address points 40 Note: there are other styles… INF385T(28620) – Fall 2013 – Lecture 8 41 Other styles… (build custom locators) Queens, NY Salt Lake City, UT Regions of Illinois & Wisconsin Germany … and many others! INF385T(28620) – Fall 2013 – Lecture 8 42 Choose reference layer Streets, ZIP Codes INF385T(28620) – Fall 2013 – Lecture 8 43 ArcGIS locator parameters INF385T(28620) – Fall 2013 – Lecture 8 44 Geocode in ArcMap Add tabular data and streets layer Add address locator Geocode addresses View geocoding results Interactively rematch addresses INF385T(28620) – Fall 2013 – Lecture 8 45 Address rematching Investigate unmatched addresses Generally requires expertise and knowledge of local streets Compare a street name in the attributes of the streets table and the address table. INF385T(28620) – Fall 2013 – Lecture 8 46 Prepare log file Log file includes reasons why addresses did not get geocoded. Useful for future work on cleaning addresses or repairing street maps Incorrect address Possible reason/solution 490 Penn Avenue Missing ZIP Code 111 Hawksworth Spelled incorrectly 900 Smallman Street TIGER street missing 900 Lib Ave Spelled incorrectly INF385T(28620) – Fall 2013 – Lecture 8 47 Summary Geocoding overview Polygon geocoding Linear (street) geocoding Problems and solutions Geocoding layer sources Geocoding in ArcGIS Next week: Tutorial chapter 9, and discussion of term projects – see iSchool syllabus links: http://courses.ischool.utexas.edu/Arctur_David/2013/fall/385T/schedule.php INF385T(28620) – Fall 2013 – Lecture 8 48