Paper: Using Government Data

advertisement
Title: Challenges in Accessing and Importing US Government Datasets into Relational Databases
Author: Andrew Ferlitsch
Date: June 17, 2013
Abstract: This paper discusses the challenges and solutions for accessing and importing datasets compiled
and provided by US Government agencies. Historically, US government agencies have been mandated to
make data collected and compiled by US tax dollars accessible to the public. In 2009, the Obama
administration launched the Open Government Initiative (OGI) to provide a centralized repository for
“public access to high value, machine readable datasets generated by the Executive Branch of the Federal
Government.” (data.gov).
What we’ve found is that there is no common convention within the US government on defining public
access to data collected. The access methods, format, layout, and character sets are just some examples
where the methodology differs from one department/agency to another and across organizations within.
These differences continue to create a high-cost barrier to those compiling and aggregating high value
datasets from multiple sources.
Topic Points:




Access Interfaces: Interactive/Form Driven, Direct (HTML/FTP) download
Formats: Flat files/CSV, XML/RDF, Document /Excel/PDF
Layout: Identifiers, Delineation, Delimiters, Units
Character Encoding: ASCII/Latin
About the Author: Andrew Ferlitsch has a MS in Computer Science from Oregon State University. He is
currently a principal researcher at Sharp Labs of America, where he is an inventor on 109 US issued
technology related patents. Mr. Ferlitsch is a co-founder and contributor to open data projects
OpenGeoCode.Org and Geoforms.Org.
Popular (US) Government Databases
US Census Bureau
 2010, 2000 and 1990 Census Data – Downloadable Datasets www.census.gov/geo/maps-data/data/gazetteer.html
Annual Estimates (2011, 2012) – Downloadable Datasets - www.census.gov/popest/
 American Fact Finder – Interactive Search - factfinder2.census.gov
 Economic Census - www.census.gov/history/www/programs/economic/economic_census.html

US Geological Survey (USGS)
 Geographic Name Information Service [Domestic] (GNIS) geonames.usgs.gov/domestic
 National Elevation Data (NED) – ned.gov
 Earthquakes - earthquake.usgs.gov/earthquakes/map
National Agency of Geospatial Intelligence (NGA)
Geographic Name Server [Foreign] (GNS) - earth-info.nga.mil/gns/html
 Country Codes [Sunsetted] (FIPS) - earth-info.nga.mil/gns/html/gazetteers2.html

US Department of Education
 Education Statistics - nces.ed.gov
Bureau of Labor Statics (BLS) - www.bls.gov/
Consumer Price Index (CPI) – www.bls.gov/data
 Employment/Unemployment

United States Department of Agriculture (USDA) - www.nass.usda.gov
 Census of Agriculture 2012, 2007, 2002 - www.agcensus.usda.gov/index.php
 Historical Data (to 1840) www.agcensus.usda.gov/Publications/2007/Online_Highlights/Fact_Sheets/Farm_Numbers/
US Energy Information Agency (EIA) - www.eia.gov

Energy Product, Consumption, Price -
www.data.gov/list/agency/5/6
National Weather Service (NWS) – weather.gov



Weather - graphical.weather.gov/xm
Climate Data - www.ncdc.noaa.gov/cdo-web/webservices/cdows_datasets
Science on a Sphere (SOS) -
Environmental Protection Agency – epa.gov

Numerous Datasets - www.data.gov/list/agency/4/*
Bureau of Land Management (BLM) – www.blm.gov
 GIS Maps & Data -
Access Interfaces
When accessing government datasets, a diverse set of interfaces are encountered.
Form/Interactive
Some datasets are only accessible through an interactive form. Interactive forms generally occur on sites
which do not allow you access to the entire dataset in a single query, but require you to request subsets of
the datasets through a form interaction. Form selections typically for narrowing your selection typically
include one or more of:



Data Category
Geographic (state, county, city, zip)
Time Period
Obtaining the entire dataset therefore is very tedious and labor consuming. In the below example, for
economic data from the US Economic Census, to obtain building permit data one needs to select a state,
than county and then time period:
http://censtats.census.gov/cgi-bin/bldgprmt/bldgdisp.pl
The next example is from the NGA geographic name server (GNS) for foreign geographic name features,
where you have to select on a per country basis.
http://earth-info.nga.mil/gns/html/namefiles.htm
Tabular
Some datasets are only available as data displayed in tabular format on the website. In this case, one must
either copy/paste the data or hand copy it.
Below is an example from the BLM of how it presents summary data in tabular format:
http://www.blm.gov/wo/st/en/prog/blm_special_areas/NLCS/summary_tables.html
FTP download
Other government websites allow you to download the data using anonymous FTP. Below are FTP
instructions for obtaining NCEP datasets from NOAA.
http://www.cpc.ncep.noaa.gov/products/wesley/ncep_data/help.html
You will also find on some government FTP sites the FTP file directory is made directly accessible as a file
explorer page. In this case, the datasets can be directly downloaded (HTTP) without using FTP by clicking on
the file link. Below is an example of obtaining datasets from the National Hydrography Dataset (NHD) from
the USGS.
ftp://nhdftp.usgs.gov/DataSets/Staged/States/FileGDB/HighResolution/
HTTP download
One of the most common access methods to datasets is as a HTTP download. Typically a page will be
displayed with one or more selectable datasets and an accompanying link. Upon clicking on the link, an
HTTP download of the dataset (as a file) is performed. Below is an example of downloading the Public Use
Micro Sample (PUMS) of the Survery of Business Owners (SBO) from the 2007 US Economic Survey:
http://www.census.gov/econ/sbo/pums.html
REST API
http://graphical.weather.gov/xml/rest.php
Open Data (.gov)
In recent years, local, state and federal governmental agencies have been migrating to standardizing access
interface to public datasets based on the open data solution from Socrata, Inc (socrata.com). The agencies
that have adopted using this interface use “data” as the subdomain, such as:
data.gov
data.wa.gov
data.seattle.gov
q
centralized interface to federal government datasets
centralized interface to State of Washington government datasets
centralized interface to City of Seattle government datasets
https://data.seattle.gov/
Formats
Flat Files/CSV
The most common format for downloading government datasets are in flat file formats. Flat file formats
can be generally defined as:


one entry (record) per line
ordered list of delimited fields
In older formats (such as still found in US Census Gazetteer datasets), fields are typically delimited by:



fixed column length
Pipe (|) symbols
Tabs
These formats are giving away to standardizing to CSV formats. CSV formats are identical to older flat file
formats, but standardize on delimitation:


Use commas (North America) or semi-colons (Europe) to delimit fields
Use double quotes around field value if value contains a delimiter.
In flat file formats, the first field is generally the primary key, which is generally followed by any secondary
or foreign keys.
primary key, secondary key, data field 1, data field 2, …
Flat file formats continue to be popular because:




Require only minimal computer skills to be parsed.
Are human readable .
CSV files can be imported into Excel Spreadsheets .
Are exchangeable between common databases (e.., MySQL, MSSQL, Postgres, etc).
Open Data initiatives (e.g., data.gov) commonly identify older flat file formats and CSV formats as TXT and
CSV respectively.
XML/RDF
XML is an alternate format to flat files for representing datasets. While these formats require more
sophistication in parsing, they have the advantage of being able to represent hierarchical relationships in
data, that otherwise would be difficult in a flat file.
Below is an example XML representation for location information which is hierarchically related to a city,
which is hierarchically related to a state.
<state locale=”Washington”>
<city locale=”Seattle”>
<feature location=”lat,lng”> … </feature>
</city>
<city locale=”Tacoma”>
</city>
</state>
Below is an example of building permit data for the District of Columbia (from data.dc.gov):
http://data.dc.gov/NewCalendar.aspx?where=Citywide&area=&what=XML&date=Issueddate&from=5%2f14%2f2013+12%3a00%3
a00+AM&to=6%2f13%2f2013+12%3a00%3a00+AM&dataset=DCRA_PERMIT&datasetid=5&whereInd=0&areaInd=0&whatInd=2&d
ateInd=0&whenInd=0&disposition=attachment&showextra=1
RDF/XML is a W3C standard for self-describing the representation of information (datasets) that are
exchanged via the Internet. The format is in XML with the layout specified in a “subject-predicate-object”
data model.
Below is an example of EPA dataset that’s been converted to an RDF format by the Data-gov Wiki, a project
by Tetherless World Constellation at Rensselaer Polytechnic Institute.
http://data-gov.tw.rpi.edu/wiki
JSON
JSON is another alternate to flat files for representing datasets. This format (JavaScript Object Notation) has
the same advantages in hierarchical representation as XML. The format is standardized and widely used for
exchanging data across the web with browser client applications written in JavaScript.
Below is the metadata description (in JSON) for the National Broadband Map:
{"status":"OK","responseTime":2,"message":[],"Results":[{"columnName":"geographyType","columnType":
"Varchar(12)","columnDescription":"The geography type of the searched geography (County, Congressional
District, etc.)"},{"columnName":"fips","columnType":"Varchar(15)","columnDescription":"The FIPS code of
the geography. For identification purposes, each geography is assigned a FIPS code (Federal Information
Processing Standard code) by the U.S. Census Bureau
"},{"columnName":"name","columnType":"Varchar(100)","columnDescription":"The name of the
geography"},{"columnName":"stateCode","columnType":"Varchar(2)","columnDescription":"The state code
of the searched
geography"},{"columnName":"envelope.minx","columnType":"Numeric(10,5)","columnDescription":"The
longitude of the lower left corner of the minimum bounding rectangle of the
geography"},{"columnName":"envelope.maxx","columnType":"Numeric(10,5)","columnDescription":"The
longitude of the upper right corner of the minimum bounding rectangle of the
geography"},{"columnName":"envelope.miny","columnType":"Numeric(10,5)","columnDescription":"The
latitude of the lower left corner of the minimum bounding rectangle of the
geography"},{"columnName":"envelope.maxy","columnType":"Numeric(10,5)","columnDescription":"The
latitude of the upper right corner of the minimum bounding rectangle of the geography"}]}
http://www.broadbandmap.gov/developer/api/census-api-by-coordinates
Excel
Some datasets are still provided only in a document format, which makes it difficult for end-users to extract
the information without manual copy/paste.
Microsoft Excel is one document format which is commonly used for representing datasets in a tabular
format. An advantage of this document format is that Excel has built-in a ‘Save As’ file option that will
convert and save (export) the data into a CSV format. Once the data has been exported into a CSV format,
the dataset can then be imported into databases in more conventional methods. Below is a screenshot of
the Save As dialog on Excel:
PDF
The Adobe Portable Document Format (PDF) is another commonly used document format for representing
datasets; particularly if they come with accompanying textual information. There is no standard way of
handling datasets in this form. Typically, an enduser has to browse the document locating the dataset,
separate from the accompanying text, and copy/paste for extraction.
Below is an example PDF publication from the US Census from the 2012 Statistical Abstract:
http://www.census.gov/compendia/statab/2012edition.html
Layouts
Keys/Identifiers
Government datasets use a variety of identifiers to uniquely identify one record from another (primary
keys).
Many of these identifiers are specific to the dataset which make it difficult to cross correlate records from
one dataset from one agency to another. A variety of approaches can be used to correlate records relating
to the same item/feature but do not share the same identification system. Below are some examples I have
found useful for this purpose:

Business Records:
o Matching by business name, DBAs and name on business license.
o Matching by phone numbers.
o Matching by reducing street address to standard abbreviated form.
130 N.E. Fourth Avenue => 130 ne 4th ave

Public Buildings:
o Matching by reducing names to standard abbreviations.
Mount Vernon Junior High School => Mt Vernon Jr High

Geographic Locations
o Matching by short, formal , common nickname, and FIPS codes.
State of Vermont => Vermont
Vermont, State of => Vermont
o
Matching within a radius of a geographic coordinate.
Delineation
Datasets are usually delineated by some hierarchical grouping of the records. One of the most common
delineations is by geographic region. For example, records may be delineated by state, then county and
then by city. In the case of a flat file, the delineation typically immediately follows the record’s unique
identifier, such as:
Identifier , state , county, city , data ….
The state, county and city identifiers may appear in short or formal name, in gazetteer or proper reading
order, formal abbreviation or FIPS code. The following all refer to the State of Vermont
Vermont
State of Vermont
Vermont, State of
VT
50
In the case of an XML file, the delineation typically encapsulates all the records within the geographic
boundary, such as:
<state name=”Vermont”>
<county name=”Franklin”>
<city name=”Bakersfield”>
<record 1 />
<record 2 />
…
</city>
<city name=”Berkshire”>
…
</city>
…
</county>
<county name=”Essex”>
…
</county>
…
</state>
Delimiters
A variety of methods for delimiting fields are found within government datasets. For flat files, fields are
generally delimited by either:
o
o
o
Fixed Column Width
Tab
Special Character: pipe symbol (|), comma (,) or semi-colon (;)
In the latter case, if the field value contains a delimiter, then the value is generally encapsulated with
double quotes or preceded with a backslash to turn off its meaning as a delimiter.
Below is a screenshot of the US Census 2000 Places gazetteer dataset. In this example, the fields are
delimited by a fixed column width. The first column (2 chars) is the state abbreviation, followed by the state
FIPS code (2 chars), which is then followed by the place FIPs code (5 chars) which is used as the primary key.
Following this is a combination of the place name and locality type. Note the locality type is part of the
place name and is inconsistently delimited. In this case, it is the last word, separated by a space, in the field
(e.g., city ).
Units
Another issue is numerical units of measurement. Measurement units are typically in US Standard
measurements, but may optionally provide measurements in metric units, particularly if the dataset is
scientific related.
The US Census 2000 Places gazetteer file is an example of a dataset that specifies unit in several methods,
both in US Standard and metric.
Columns 1-2
United States Postal Service State Abbreviation
Columns 3-4
State Federal Information Processing Standard (FIPS) code
Columns 5-9
Place FIPS Code
Columns 10-73
Name
Columns 74-82
Total Population (2000)
Columns 83-91
Total Housing Units (2000)
Columns 92-105
Land Area (square meters) - Created for statistical purposes only.
Columns 106119
Water Area(square meters) - Created for statistical purposes only.
Columns 120131
Land Area (square miles) - Created for statistical purposes only.
Columns 132143
Water Area (square miles) - Created for statistical purposes only.
Columns 144153
Latitude (decimal degrees) First character is blank or "-" denoting North or South latitude respectively
Columns 154164
Longitude (decimal degrees) First character is blank or "-" denoting East or West longitude
respectively
www.census.gov/geo/maps-data/data/gazetteer2000.html
Other Issues – Historical and Out-of-Date Records
Other issues that need to be considered when using government datasets are that many of these datasets
are a continuum of data collected over a long period of time, and may contain records that are historical or
otherwise have out-of-date information.
For example, using the USGS GNIS search form for records containing “San Francisco” in the feature class
“Civil” will produce the following results which include historical records, such as record X which is a
reference to an old land grant (1842) in Los Angeles County, whose formal name is Rancho San Francisco.
Feature Name
ID
Clas
s
County
City of San Francisco
24117
86
Civil
San
Francisco
CA
374642 1222633
N
W
256
San Francisco
North
-
11-MAR2008
City of South San
Francisco
24119
42
Civil
San Mateo
CA
373910 1222036
N
W
0
Hunters Point
-
11-MAR2008
San Francisco
27347
3
Civil
Los
Angeles
CA
342425 1183643
N
W
1440
Newhall
-
19-JAN1981
San Francisco County
27730
2
Civil
San
Francisco
CA
374642 1222633
N
W
256
San Francisco
North
-
01-FEB1993
San Francisco De Las
Llagas
23462
6
Civil
Santa
Clara
CA
370427 1213722
N
W
315
Gilroy
-
19-JAN1981
San Francisco Division
19352
84
Civil
San
Francisco
CA
374642 1222633
N
W
256
San Francisco
North
-
26-SEP2001
San Francisco Mining
District
85094
9
Civil
White Pine
NV
392430 1145133
N
W
7034
McGill
-
01-MAY1990
San Francisco
Subbarrio
24159
92
Civil
Humacao
PR
180911 0654921
N
W
59
Humacao
-
21-MAR2008
San Francisco
Subbarrio
24159
93
Civil
San Juan
PR
182757 0660647
N
W
36
San Juan
-
21-MAR2008
South San Francisco
Division
19353
22
Civil
San Mateo
CA
373842 1222443
N
W
13
San Francisco
South
-
26-SEP2001
Township of San
Francisco
66555
1
Civil
Carver
MN
444133 0934235
N
W
889
Jordan West
-
01-SEP1995
Wake Island
18027
00
Civil
Wake
Island
UM
191655 1663901
N
E
30
Unknown
-
06-OCT1998
Characters
Stat Latitud
e
e
Longitu
de
Ele(ft
)*
BGN
Date
Map**
Entry Date
Another issue to consider is the handling of non-ASCII characters. Since the character set for the English
language is ASCII-only, many datasets store the dataset in ASCII format (7-bit chars). This can be
problematic if the dataset contains place names with Latin diacritic characters, such as place names from
the US territory of Puerto Rico or other names whose official spelling is non-US native.
Download