Title: Challenges in Accessing and Importing US Government Datasets into Relational Databases Author: Andrew Ferlitsch Date: June 17, 2013 Abstract: This paper discusses the challenges and solutions for accessing and importing datasets compiled and provided by US Government agencies. Historically, US government agencies have been mandated to make data collected and compiled by US tax dollars accessible to the public. In 2009, the Obama administration launched the Open Government Initiative (OGI) to provide a centralized repository for “public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government.” (data.gov). What we’ve found is that there is no common convention within the US government on defining public access to data collected. The access methods, format, layout, and character sets are just some examples where the methodology differs from one department/agency to another and across organizations within. These differences continue to create a high-cost barrier to those compiling and aggregating high value datasets from multiple sources. Topic Points: Access Interfaces: Interactive/Form Driven, Direct (HTML/FTP) download Formats: Flat files/CSV, XML/RDF, Document /Excel/PDF Layout: Identifiers, Delineation, Delimiters, Units Character Encoding: ASCII/Latin About the Author: Andrew Ferlitsch has a MS in Computer Science from Oregon State University. He is currently a principal researcher at Sharp Labs of America, where he is an inventor on 109 US issued technology related patents. Mr. Ferlitsch is a co-founder and contributor to open data projects OpenGeoCode.Org and Geoforms.Org. Popular (US) Government Databases US Census Bureau 2010, 2000 and 1990 Census Data – Downloadable Datasets www.census.gov/geo/maps-data/data/gazetteer.html Annual Estimates (2011, 2012) – Downloadable Datasets - www.census.gov/popest/ American Fact Finder – Interactive Search - factfinder2.census.gov Economic Census - www.census.gov/history/www/programs/economic/economic_census.html US Geological Survey (USGS) Geographic Name Information Service [Domestic] (GNIS) geonames.usgs.gov/domestic National Elevation Data (NED) – ned.gov Earthquakes - earthquake.usgs.gov/earthquakes/map National Agency of Geospatial Intelligence (NGA) Geographic Name Server [Foreign] (GNS) - earth-info.nga.mil/gns/html Country Codes [Sunsetted] (FIPS) - earth-info.nga.mil/gns/html/gazetteers2.html US Department of Education Education Statistics - nces.ed.gov Bureau of Labor Statics (BLS) - www.bls.gov/ Consumer Price Index (CPI) – www.bls.gov/data Employment/Unemployment United States Department of Agriculture (USDA) - www.nass.usda.gov Census of Agriculture 2012, 2007, 2002 - www.agcensus.usda.gov/index.php Historical Data (to 1840) www.agcensus.usda.gov/Publications/2007/Online_Highlights/Fact_Sheets/Farm_Numbers/ US Energy Information Agency (EIA) - www.eia.gov Energy Product, Consumption, Price - www.data.gov/list/agency/5/6 National Weather Service (NWS) – weather.gov Weather - graphical.weather.gov/xm Climate Data - www.ncdc.noaa.gov/cdo-web/webservices/cdows_datasets Science on a Sphere (SOS) - Environmental Protection Agency – epa.gov Numerous Datasets - www.data.gov/list/agency/4/* Bureau of Land Management (BLM) – www.blm.gov GIS Maps & Data - Access Interfaces When accessing government datasets, a diverse set of interfaces are encountered. Form/Interactive Some datasets are only accessible through an interactive form. Interactive forms generally occur on sites which do not allow you access to the entire dataset in a single query, but require you to request subsets of the datasets through a form interaction. Form selections typically for narrowing your selection typically include one or more of: Data Category Geographic (state, county, city, zip) Time Period Obtaining the entire dataset therefore is very tedious and labor consuming. In the below example, for economic data from the US Economic Census, to obtain building permit data one needs to select a state, than county and then time period: http://censtats.census.gov/cgi-bin/bldgprmt/bldgdisp.pl The next example is from the NGA geographic name server (GNS) for foreign geographic name features, where you have to select on a per country basis. http://earth-info.nga.mil/gns/html/namefiles.htm Tabular Some datasets are only available as data displayed in tabular format on the website. In this case, one must either copy/paste the data or hand copy it. Below is an example from the BLM of how it presents summary data in tabular format: http://www.blm.gov/wo/st/en/prog/blm_special_areas/NLCS/summary_tables.html FTP download Other government websites allow you to download the data using anonymous FTP. Below are FTP instructions for obtaining NCEP datasets from NOAA. http://www.cpc.ncep.noaa.gov/products/wesley/ncep_data/help.html You will also find on some government FTP sites the FTP file directory is made directly accessible as a file explorer page. In this case, the datasets can be directly downloaded (HTTP) without using FTP by clicking on the file link. Below is an example of obtaining datasets from the National Hydrography Dataset (NHD) from the USGS. ftp://nhdftp.usgs.gov/DataSets/Staged/States/FileGDB/HighResolution/ HTTP download One of the most common access methods to datasets is as a HTTP download. Typically a page will be displayed with one or more selectable datasets and an accompanying link. Upon clicking on the link, an HTTP download of the dataset (as a file) is performed. Below is an example of downloading the Public Use Micro Sample (PUMS) of the Survery of Business Owners (SBO) from the 2007 US Economic Survey: http://www.census.gov/econ/sbo/pums.html REST API http://graphical.weather.gov/xml/rest.php Open Data (.gov) In recent years, local, state and federal governmental agencies have been migrating to standardizing access interface to public datasets based on the open data solution from Socrata, Inc (socrata.com). The agencies that have adopted using this interface use “data” as the subdomain, such as: data.gov data.wa.gov data.seattle.gov q centralized interface to federal government datasets centralized interface to State of Washington government datasets centralized interface to City of Seattle government datasets https://data.seattle.gov/ Formats Flat Files/CSV The most common format for downloading government datasets are in flat file formats. Flat file formats can be generally defined as: one entry (record) per line ordered list of delimited fields In older formats (such as still found in US Census Gazetteer datasets), fields are typically delimited by: fixed column length Pipe (|) symbols Tabs These formats are giving away to standardizing to CSV formats. CSV formats are identical to older flat file formats, but standardize on delimitation: Use commas (North America) or semi-colons (Europe) to delimit fields Use double quotes around field value if value contains a delimiter. In flat file formats, the first field is generally the primary key, which is generally followed by any secondary or foreign keys. primary key, secondary key, data field 1, data field 2, … Flat file formats continue to be popular because: Require only minimal computer skills to be parsed. Are human readable . CSV files can be imported into Excel Spreadsheets . Are exchangeable between common databases (e.., MySQL, MSSQL, Postgres, etc). Open Data initiatives (e.g., data.gov) commonly identify older flat file formats and CSV formats as TXT and CSV respectively. XML/RDF XML is an alternate format to flat files for representing datasets. While these formats require more sophistication in parsing, they have the advantage of being able to represent hierarchical relationships in data, that otherwise would be difficult in a flat file. Below is an example XML representation for location information which is hierarchically related to a city, which is hierarchically related to a state. <state locale=”Washington”> <city locale=”Seattle”> <feature location=”lat,lng”> … </feature> </city> <city locale=”Tacoma”> </city> </state> Below is an example of building permit data for the District of Columbia (from data.dc.gov): http://data.dc.gov/NewCalendar.aspx?where=Citywide&area=&what=XML&date=Issueddate&from=5%2f14%2f2013+12%3a00%3 a00+AM&to=6%2f13%2f2013+12%3a00%3a00+AM&dataset=DCRA_PERMIT&datasetid=5&whereInd=0&areaInd=0&whatInd=2&d ateInd=0&whenInd=0&disposition=attachment&showextra=1 RDF/XML is a W3C standard for self-describing the representation of information (datasets) that are exchanged via the Internet. The format is in XML with the layout specified in a “subject-predicate-object” data model. Below is an example of EPA dataset that’s been converted to an RDF format by the Data-gov Wiki, a project by Tetherless World Constellation at Rensselaer Polytechnic Institute. http://data-gov.tw.rpi.edu/wiki JSON JSON is another alternate to flat files for representing datasets. This format (JavaScript Object Notation) has the same advantages in hierarchical representation as XML. The format is standardized and widely used for exchanging data across the web with browser client applications written in JavaScript. Below is the metadata description (in JSON) for the National Broadband Map: {"status":"OK","responseTime":2,"message":[],"Results":[{"columnName":"geographyType","columnType": "Varchar(12)","columnDescription":"The geography type of the searched geography (County, Congressional District, etc.)"},{"columnName":"fips","columnType":"Varchar(15)","columnDescription":"The FIPS code of the geography. For identification purposes, each geography is assigned a FIPS code (Federal Information Processing Standard code) by the U.S. Census Bureau "},{"columnName":"name","columnType":"Varchar(100)","columnDescription":"The name of the geography"},{"columnName":"stateCode","columnType":"Varchar(2)","columnDescription":"The state code of the searched geography"},{"columnName":"envelope.minx","columnType":"Numeric(10,5)","columnDescription":"The longitude of the lower left corner of the minimum bounding rectangle of the geography"},{"columnName":"envelope.maxx","columnType":"Numeric(10,5)","columnDescription":"The longitude of the upper right corner of the minimum bounding rectangle of the geography"},{"columnName":"envelope.miny","columnType":"Numeric(10,5)","columnDescription":"The latitude of the lower left corner of the minimum bounding rectangle of the geography"},{"columnName":"envelope.maxy","columnType":"Numeric(10,5)","columnDescription":"The latitude of the upper right corner of the minimum bounding rectangle of the geography"}]} http://www.broadbandmap.gov/developer/api/census-api-by-coordinates Excel Some datasets are still provided only in a document format, which makes it difficult for end-users to extract the information without manual copy/paste. Microsoft Excel is one document format which is commonly used for representing datasets in a tabular format. An advantage of this document format is that Excel has built-in a ‘Save As’ file option that will convert and save (export) the data into a CSV format. Once the data has been exported into a CSV format, the dataset can then be imported into databases in more conventional methods. Below is a screenshot of the Save As dialog on Excel: PDF The Adobe Portable Document Format (PDF) is another commonly used document format for representing datasets; particularly if they come with accompanying textual information. There is no standard way of handling datasets in this form. Typically, an enduser has to browse the document locating the dataset, separate from the accompanying text, and copy/paste for extraction. Below is an example PDF publication from the US Census from the 2012 Statistical Abstract: http://www.census.gov/compendia/statab/2012edition.html Layouts Keys/Identifiers Government datasets use a variety of identifiers to uniquely identify one record from another (primary keys). Many of these identifiers are specific to the dataset which make it difficult to cross correlate records from one dataset from one agency to another. A variety of approaches can be used to correlate records relating to the same item/feature but do not share the same identification system. Below are some examples I have found useful for this purpose: Business Records: o Matching by business name, DBAs and name on business license. o Matching by phone numbers. o Matching by reducing street address to standard abbreviated form. 130 N.E. Fourth Avenue => 130 ne 4th ave Public Buildings: o Matching by reducing names to standard abbreviations. Mount Vernon Junior High School => Mt Vernon Jr High Geographic Locations o Matching by short, formal , common nickname, and FIPS codes. State of Vermont => Vermont Vermont, State of => Vermont o Matching within a radius of a geographic coordinate. Delineation Datasets are usually delineated by some hierarchical grouping of the records. One of the most common delineations is by geographic region. For example, records may be delineated by state, then county and then by city. In the case of a flat file, the delineation typically immediately follows the record’s unique identifier, such as: Identifier , state , county, city , data …. The state, county and city identifiers may appear in short or formal name, in gazetteer or proper reading order, formal abbreviation or FIPS code. The following all refer to the State of Vermont Vermont State of Vermont Vermont, State of VT 50 In the case of an XML file, the delineation typically encapsulates all the records within the geographic boundary, such as: <state name=”Vermont”> <county name=”Franklin”> <city name=”Bakersfield”> <record 1 /> <record 2 /> … </city> <city name=”Berkshire”> … </city> … </county> <county name=”Essex”> … </county> … </state> Delimiters A variety of methods for delimiting fields are found within government datasets. For flat files, fields are generally delimited by either: o o o Fixed Column Width Tab Special Character: pipe symbol (|), comma (,) or semi-colon (;) In the latter case, if the field value contains a delimiter, then the value is generally encapsulated with double quotes or preceded with a backslash to turn off its meaning as a delimiter. Below is a screenshot of the US Census 2000 Places gazetteer dataset. In this example, the fields are delimited by a fixed column width. The first column (2 chars) is the state abbreviation, followed by the state FIPS code (2 chars), which is then followed by the place FIPs code (5 chars) which is used as the primary key. Following this is a combination of the place name and locality type. Note the locality type is part of the place name and is inconsistently delimited. In this case, it is the last word, separated by a space, in the field (e.g., city ). Units Another issue is numerical units of measurement. Measurement units are typically in US Standard measurements, but may optionally provide measurements in metric units, particularly if the dataset is scientific related. The US Census 2000 Places gazetteer file is an example of a dataset that specifies unit in several methods, both in US Standard and metric. Columns 1-2 United States Postal Service State Abbreviation Columns 3-4 State Federal Information Processing Standard (FIPS) code Columns 5-9 Place FIPS Code Columns 10-73 Name Columns 74-82 Total Population (2000) Columns 83-91 Total Housing Units (2000) Columns 92-105 Land Area (square meters) - Created for statistical purposes only. Columns 106119 Water Area(square meters) - Created for statistical purposes only. Columns 120131 Land Area (square miles) - Created for statistical purposes only. Columns 132143 Water Area (square miles) - Created for statistical purposes only. Columns 144153 Latitude (decimal degrees) First character is blank or "-" denoting North or South latitude respectively Columns 154164 Longitude (decimal degrees) First character is blank or "-" denoting East or West longitude respectively www.census.gov/geo/maps-data/data/gazetteer2000.html Other Issues – Historical and Out-of-Date Records Other issues that need to be considered when using government datasets are that many of these datasets are a continuum of data collected over a long period of time, and may contain records that are historical or otherwise have out-of-date information. For example, using the USGS GNIS search form for records containing “San Francisco” in the feature class “Civil” will produce the following results which include historical records, such as record X which is a reference to an old land grant (1842) in Los Angeles County, whose formal name is Rancho San Francisco. Feature Name ID Clas s County City of San Francisco 24117 86 Civil San Francisco CA 374642 1222633 N W 256 San Francisco North - 11-MAR2008 City of South San Francisco 24119 42 Civil San Mateo CA 373910 1222036 N W 0 Hunters Point - 11-MAR2008 San Francisco 27347 3 Civil Los Angeles CA 342425 1183643 N W 1440 Newhall - 19-JAN1981 San Francisco County 27730 2 Civil San Francisco CA 374642 1222633 N W 256 San Francisco North - 01-FEB1993 San Francisco De Las Llagas 23462 6 Civil Santa Clara CA 370427 1213722 N W 315 Gilroy - 19-JAN1981 San Francisco Division 19352 84 Civil San Francisco CA 374642 1222633 N W 256 San Francisco North - 26-SEP2001 San Francisco Mining District 85094 9 Civil White Pine NV 392430 1145133 N W 7034 McGill - 01-MAY1990 San Francisco Subbarrio 24159 92 Civil Humacao PR 180911 0654921 N W 59 Humacao - 21-MAR2008 San Francisco Subbarrio 24159 93 Civil San Juan PR 182757 0660647 N W 36 San Juan - 21-MAR2008 South San Francisco Division 19353 22 Civil San Mateo CA 373842 1222443 N W 13 San Francisco South - 26-SEP2001 Township of San Francisco 66555 1 Civil Carver MN 444133 0934235 N W 889 Jordan West - 01-SEP1995 Wake Island 18027 00 Civil Wake Island UM 191655 1663901 N E 30 Unknown - 06-OCT1998 Characters Stat Latitud e e Longitu de Ele(ft )* BGN Date Map** Entry Date Another issue to consider is the handling of non-ASCII characters. Since the character set for the English language is ASCII-only, many datasets store the dataset in ASCII format (7-bit chars). This can be problematic if the dataset contains place names with Latin diacritic characters, such as place names from the US territory of Puerto Rico or other names whose official spelling is non-US native.