Dexter - Missouri Census Data Center

advertisement
Dexter
The Missouri Census
Data Center’s
Data Extraction
Utility
John Blodgett: OSEDA, University of Missouri
Rev.14May2007, jgb
What Is Dexter?
A web utility for performing simple data
queries, or extracts.
An integral part of the MCDC’s Uexplorebased data exploration/access system.
Written in SAS© to access data stored in
SAS datasets but requires no knowledge
of nor access to SAS.
Who Uses Dexter?
Anyone interested in accessing the MCDC data
archive, especially anyone who wants to directly
access and manipulate the data.
Not (directly) intended for the very casual data
user. Has a small but non-trivial learning curve.
Understanding the mechanics of Dexter is easy
compared to understanding the data to be
extracted.
Dexter’s Role Within Uexplore
Dexter accepts parameters that identify a
database file/table from which data are to be
extracted.
Uexplore provides the navigation tools to help
locate and understand the content of datasets.
Uexplore hyperlinks actually invoke uex2dex,
the dexter preprocessor, which in turn invokes
dexter.
Uexplore Page With Hyperlinks
The URL Used to Invoke Dexter
On the previous screen the dataset name
(usccflows.sas7bdat) is a hyperlink. The URL
associated with it is:
http://mcdc2.missouri.edu/cgibin/broker?_PROGRAM=websas.uex2dex.sa
s&_SERVICE=appdev9&path=/pub/data/mig2
000&dset=usccflows&view=0
It calls a program named uex2dex, written in SAS, and
passes parms to ID the data table to be queried.
Dexter and Census Data
Dexter doesn’t really know much about the
datasets from which it extracts data.
It is not American FactFinder . It is just a
generic extraction tool.
It uses only very basic metadata tools.
Other tools must be used to assist users in
navigating the database.
Dexter and the MCDC Data
Archive
Technically, there is nothing inherent in Dexter
that ties it to this archive.
In practice, however, the collection of public data
files that we call the “MCDC Data Archive” is
what Dexter was created for.
It is very probable the only reason you’re
reading this is because you want to access
something in that archive.
How Do You Invoke Dexter?
Most people will start at the uexplore home
page http://mcdc.missouri.edu/applications/uexplore.shtml
You navigate the data collection by choosing
“filetype” directories and at some point (…yada
yada yada) you wind up selecting (clicking on) a
file that is a data table.
Clicking on the data table invokes the uex2dex
preprocessor. You fill out the form which uex2dex
generates and click on an “Extract Data” button to
actually invoke Dexter.
Accessing Uexplore
(Home Page)
From the MCDC home page (or any page with the navy
blue navigation bar) click on “MCDC Data Archive”.
Or enter the URL:
http://mcdc2.missouri.edu/applications/uexplore.shtml
Choose Major Category
(from the links in teal box)
Scroll Within the Filetype
Descriptions to Find the Type
(mig2000)
Click on the Filetype Name
(links to uexplore for that directory/filetype)
In this case we want to click on the mig2000
filetype. The text tells us what kind of data we
can expect to find in this directory.
Uexplore Page - mig2000 Filetype
This page is all about hyperlinks (all the blue text). Before
proceeding to the the Dexter-invocation links we want to
back up and look at the data archive structure.
(Back to)
The Uexplore/Dexter Home Page
The Archive Directory
(on the Uexplore/Dexter home page)
The teal box contains links to 8 major data
categories (2000 Census thru Compendia)
The rest of the page consists mostly of
descriptions of and hyperlinks to the
archive’s data categories (which we refer
to as filetypes.)
Filetypes within the major categories are
sorted in descending order of what we
think will be their popularity.
Sf32000x is our most popular filetype.
What’s In the Archive?
Very important question. But not the focus
of this tutorial. Some day we’ll do a
separate, long tutorial just on that topic.
Not all filetypes are created equal. We
spend 90% of our resources on maybe
10% of our data directories.
Filetypes that are in bold are the MCDC
“house specialties”.
The Data Archive – General Info
We keep the data table files (the things Dexter
accesses) in the same directories along with other
related files (metadata, spreadsheets, csv files,
Readme.html files, etc.)
Each filetype directory has a special Tools subdirectory
where we keep program code and other tool modules
related to the data.
Subdirectories & Files starting with uppercase letters are
listed first and are usually worth looking at.
Dexter-accessible table files (“SAS datasets”) have
extensions of sas7bdat or sas7bvew.
Exercise
The Bureau of Economic Analysis
disseminates its REIS data with key
economic indictors for US geography
down to the county level.
Locate the filetype corresponding to this
data collection and navigate to the
directory page.
What’s the major category?
Uexplore Data Directory Page
What you see when you click on the beareis link on the Uexplore
home page. It displays a list of files within the directory. The
“File” column entries are hyperlinks. With a few exceptions the
files are displayed in alphabetical order.
Datasets.html is a special file providing enhanced navigation of
the data files in this dir. It displays just the data-table files, but
in a more logical order and with additional metadata.
Datasets.html page
Datasets.html Columns
The Name column is also a link to uex2dex /
dexter.
Label is a short description of the dataset.
#Rows (# of observations) and #Cols (# of
columns/variables) are taken from the datasets
metadata set. As are the Geographic Universe
and Units.
Link to Details is the most important column.
Universe and Units
The majority of datasets in the archive contain
summary data for geographic areas. For
example, a dataset in the popests directory
might contain the latest estimates for all counties
in the state of Missouri. The geographic
universe is Missouri, and the units are counties.
When we have many datasets in a directory it’s
usually because we have many different
combinations of universe and units.
Common Universes
Missouri (the state of) is by far the most common
universe for the MCDC archive.
United States is second – we have quite a
number of national datasets.
Illinois and Kansas are also very common since
we routinely download and convert census files
for these key neighbor states.
A common sort order for files on Datasets.html
pages is Missouri files first, then US, then IL/KS
and then other states.
Rows & Columns
The rows of the data tables are typically
geographic entities: a state, a county, a city, etc
Most of the columns in the data tables are
summary stats for the entity: e.g. the 2000 pop
count, the latest estimated pop, the change and
percent change, etc.
Other columns (“variables”) are identifiers with
names such as sumlev, geocode and areaname
.
Numeric vs. Character Variables
SAS© stores data as character strings or as
numerics.
We store all identifiers (geographic codes, etc)
as character strings even if they are made up of
numeric digits.
So the value of the state code for CT is “09”, not
9. The leading “0” matters.
Unfortunately, Excel ignores the distinction when
importing csv files.
Dataset Naming Conventions
All filetype names are 8 characters or less.
Dataset names were limited to 8 characters by the
software until recently.
The first characters of the dataset name often
correspond to the universe – e.g. “mo”, “il”, “us”.
The geo units are often part of the ds-name – e.g.
“motracts”, “uszips”.
For time series data the name usually ends with a
time indicator – e.g. “uscom03” contains data thru
2003.
Variable Naming Conventions
Not as rigorously applied as we might like, esp.
for older datasets (conventions used for 1980
datasets differ a little from 2K and 1990 sets, for
example)
Certain names appear on many datasets and
are consistent. These are mostly identifier
variables, the ones used in creating filters and
for merging data from different files.
Consistency With Census Bureau
Data Dictionary Names
The Bureau often distributes data dictionary files
with their data that include suggested names for
the fields.
Their name for the field (variable) containing the
name of the geographic area being summarized
is ANPSADPI. We decided to go with
AreaName instead.
But in most cases we try to use the same name
as in the data dictionary.
Common ID Variables
SumLev: Geographic summary level
codes as used in 2K census. (3-char)
State: 2-char state FIPS code.
County: 5-char county FIPS code, incl. the
state.
Geocode: A composite code to id a
geographic area. E.g. the value for a
census tract might be “29019-0010.00”.
AreaName: Name of the area.
Common ID Variables (cont)
Tract: census tract in tttt.ss format, always
7 characters with leading 0s and 00
suffixes. E.g. “0012.00” .
Esriid: Similar to geocode but intended to
use as a key for linking to shape files from
ESRI (the ArcInfo people). When
geocode=“29019-0010.00” the value of
esriid=“29019001000”.
SAS Formats
Some variables have custom formats
associated with them, which cause them to
display a name instead of their actual value.
E.g. the variable County may have a value of
“29019” but displays as “Boone MO” using the
format. Most Dexter output has the formatted
values.
Click the “View qmeta Metadata report” option at
the end of Section II on the Dexter form to see
which variables have formats associated.
More About the MCDC Data
Archive
http://mcdc2.missouri.edu/tutorials/
mcdc_data_archive.ppt
Details Page
We get here by
clicking on the
Details link on
Datasets.html
page.
Lots of content
here – but will
vary.
Key variables
is often
extremely
useful when
doing filters.
Note the direct
link to Dexter
under Access
the dataset near
the bottom.
Increase Text Size to Read Fine Print
Exercise – Navigate to Dataset
Earlier we were looking at datasets in the 2000
Census category, filetype mig2000.
Go to the Uexplore home page and navigate to
this filetype.
Use the Datasets.html page to display the
datasets within the directory.
Find the row for the usccflows data table and
click on the Details link for this table.
From the Details page click on the keyvals link
for the variable State.
Key Variables Report: State
Tells you that
the variable
State has a
value of 01 (for
“Alabama”) in
22137 rows of
this dataset.
This can be
very helpful
when doing a
data filter in
Dexter.
Finally…
Time to See Dexter
Dexter Input Page (Top)
Sec. I. Output
Format(s): csv
file (into Excel)
most common.
Sec. II is where
the work is. Only
2 of 5 rows
shown here.
User fills out the
entire form before
using Extract Data
button to invoke
Dexter.
Dexter Section II
Filters
“A filter is a logical condition that references values of columns
within a row. For each row, the condition is evaluated and, if
true, the row is selected for output. (If not, the row is omitted,
or "filtered".) To keep all rows, just skip this section. The filter
being created here can consist of up to 5 logical segments,
each referencing a data set Variable, a relational Operator,
and a data Value (or values) -- constants that the user
must type in. The segments are evaluated as true or false.
Logical operators (which default to And and appear between
the segment specification rows) relate the segments when
more than one is specified, creating a compound logical
condition.”
If this explanation makes sense to you then you are going to
have an easy time with Dexter. If not, follow through the
examples and then try reading it again.
Example of Defining a Filter:
What We Want
Assuming we are running dexter to access the
mig2000.usccflows dataset we want to select
only those rows that:
– have Missouri as the anchor state, and
– have at least 100 gross flows.
We’ll just assume you’ve read the descriptions
and have some clue regarding what an anchor
state and a gross flow are. (People interested
in population migration would be likely to know
this.)
Select Variable for Filter
Click on the Variable/Column drop-down
menu in the 1st row and select State.
Select Comparison Operator
Select “Equal to” as the Operator from drop down
menu in the middle column.
Enter Value to Complete Row
Remember the Key Values report showing all the values for the variable
State? If you did not know the code for Missouri you could find it there.
What We Have So Far
We have created a logical condition that
can be evaluated for each row of the
dataset:
State = ’29’
According the key values report for
State we know that this condition is true
for 38,316 rows in the dataset. The
filter we are building will select just
those 38,316 rows out of the 1.1+
million in the full dataset.
Adding a Second Condition
But we do not want all the cases pertaining to
Missouri as the anchor state. We only want
those where we have at least 100 gross flows
(whatever those are).
So we need to fill out a second row, adding
this condition. We select GrossMig as the
variable, Greater Than or Equal To as the
Operator and enter 100 in the Value field.
We leave the logical operator radio button set
to “And” to indicate that this is an additional
necessary condition.
The Completed Filter
You are now ready to scroll down to Section III.
Section III: Choose Variables
Conceptually simple section:
just select the variables you
want on your output from
scrollable (if needed) menu
lists.
Identifiers (character type
variables) are listed separate
from numerics. Important
MCDC Data Archive
convention.
Typing names instead of
selecting is possible but not
recommended.
Here we select all variables
except State.
Section IV: Title & Sort Order
Entirely optional,
typically not used
section.
Title value is used as
report title if you asked
for one, which we did
not in this example.
Sort specs are handy.
Note use of minus sign
(hyphen) to indicate a
descending sort.
Another Extract Data
button to use to run
query.
Dexter Output Page
The first output you
see is this results
“index” page.
Always a link to a
Summary Log page
Additional links
depend on output
formats requested.
Dexter Summary Log
This file always
generated. Important
for documenting the
query.
Indicates what file(s),
when run, as well as
any filter and the
variables kept.
Output directory
details can usually be
ignored.
Select Output File(s)
Click on Delimited File Link
What happens when
you click on this file
depends on how your
browser is configured.
The file referenced
has a .csv extension
which IE usually
associates with the
Excel plugin.
Clicking this link will
typically invoke Excel.
Viewing .csv Output in Excel
The csv file is
read into Excel.
Rows 1 & 2
have names &
labels.
Other rows
contain the
selected data.
Note sort order.
Some Key Points So Far
Navigation tools such as the uexplore home
page, the uexplore directory navigator and
Datasets.html reference pages are used to
make accessing data with Dexter easier.
You get to select rows (“filter”) and columns
as well as the format(s) of your extracted
data.
Filtering often requires knowledge of code
values. These can sometimes be accessed
from the Key Values reports on the Details
page referenced by a Datasets.html page.
The query generated is summarized on a
Summary.log page.
Pop Quiz
1.
2.
3.
4.
5.
6.
Can Dexter be used to access an xls file?
How are the files sorted on a directory page
displayed by uexplore?
What does the uex2dex interface app do?
What is the fastest way to tell how many rows
were selected by your query?
Which of the 5 sections of the Dexter query
form must be filled out to have a valid
request?
What’s a filetype? What does it mean when
one is displayed in bold on the Uexplore home
(Archive Directory) page?
Sample Query 2:
What We Want
We want data from the 2000 Census, Summary File 3
regarding poverty in Missouri – in cities and counties.
We want the number and the % of poor persons, as
well as the median household income.
We only want the data for cities of at least 5000
persons, but for all counties and for the state as a
whole.
We want output as an HTML file sorted by the type of
geography (state, county, city) and then by descending
poverty rate.
What You Need to Know
You need to know where these kinds of data
are stored. It is 2000 census data, but where
among all those different summary files?
Read the brief descriptions on the uexplore
home page. The sf32000 filetype looks good,
but it turns out that it is too big. The standard
extract version, sf32000x, has what we need.
An alternate way by which users may arrive
here is via links on the MCDC Demographic
Profile reports.
A Demographic Profile Report
A link at the bottom of this report page invokes Dexter with the appropriate
dataset selected. Follow the link (in title of this page) and try it.
(Back on uexplore home page)
Click on sf32000x to Start
Descriptions with links from the uexplore home
page.
The sf32000x Directory
(As seen by uexplore)
Subdirectories &
files with upcased
first letters are
shown first.
Index.html,
Readme.html and,
of course,
Datasets.html are
required reading
(browsing).
Files are in
alphabetical (not
logical) order.
(sf32000x) Readme.html
The Datasets.html Page
(for the sf32000x filetype)
Details Page -- sf32000x.moi
Lots of info here. Most important is perhaps the Key variables
link for variable SumLev (geographic summary level).
Key Variables Report for SumLev
(stf32000x.moi)
Filters Based on SumLev
Var
SumLev
Operator Value
Equals
040
Results
State Level
Summary (only 1
row selected)
140
Census Tract
Summaries – 1320
rows selected.
040:050:160 1 State level , 115
County level & 972
Place level rows.
SumLev
Equals
SumLev
In List
Sample Query 2: What We Want
(Repeated in case you forgot)
We want data from the 2000 Census, Summary File
3 regarding poverty in Missouri cities and counties.
We want the number and the % of poor persons, as
well as the median household income.
We only want the data for cities of at least 5000
persons, but for all counties and for the state.
We want output as an HTML file sorted by the type
of geography (state, county, city) and then by
descending poverty rate.
A Complex Filter
The Filter Explained
There are 2 logical parts to the filter:
1. SumLev In (‘040’,’050’)
2. Sumlev = ‘160’ and TotPop >= 5000
The parentheses checkboxes are used
to group the 2nd & 3rd lines. The and
between lines 2 and 3 is executed before
the or between lines 1 and 2.
The Filter Explained, cont.
The SAS© code generatd by these menu
choices :
where sumlev in (‘040’,’050’) or
(sumlev=‘160’ and totpop >=5000);
The “in” operator (called “In List” on Operator
pull-down menu) allows specifying that the value
of a variable should be one of a list of values.
Those values are entered separated by :’s in
the Value column of the filter specs form.
Completing the Query: Parts 3 & 4
HTML Output
We see that
Pemiscot has
the highest
poverty rate of
any county.
How do we
know this?
Why don’t we
see any data
for cities?
Exercise
Access the same dataset as in the example:
sf32000x.moi
Select census tract summaries in Greene co…
… with a poverty rate of at least 10%.
Keep all identifiers necessary to identify the tract,
and all variables related to poverty.
Generate a csv file and load it into a spreadsheet
(probably Excel).
Exercise 2
Repeat the previous exercise except do it
for all counties (instead of census tracts) in
the states of Arkansas and Oklahoma.
Sort the results by descending poverty
rate and generate output in pdf format as
well as a csv file.
Hint: A good place is start is with the
Datasets.html page.
Begin Summary File
Processing Section
Advanced section that can be
skipped by many users. But note
that AFF can be used instead to
access most such data.
Accessing Summary (Tape) Files
The Census Bureau creates very large tablebased summary files. For each census since
1970.
The MCDC has a good collection of such files
for ’80, a few for ’90 and many for 2k.
Filetype names begin “stf” or “sf” (the “t” was
dropped in 2000.)
E.g. stf803 for 1980 Summary Tape File 3,
sf12000 for 2000 Summary File 1.
Follow links off Census section of uexplore
home page.
Getting Started with S(T)Fs
If you are new to using Census data and/or summary
files we highly recommend that you use the American
FactFinder application to become familiar with these
files.
From the AFF page:
Under “Getting Detailed Data” follow the links to “About the Data”
and then to “Data Sets”
Experiment/practice locating and extracting tables for geographic
areas of interest.
Use the Census 2000 Summary File 3 (SF3) data set and specify
you want “Detailed Tables”.
Make use of the “by subject” & “by keyword” tabs to select tables.
Exercise – Use AFF to Access
2000 Summary File 3
With Census 2000-SF3 chosen, use the Select
Geography step to choose the state of Missouri
and Boone county.
Under Select Tables use “by subject” tab and
search for tables related to poverty.
Find a table that has data on # persons below
50% of poverty level.
Display the relevant tables for the 2 geographic
areas selected.
When To Use Uexplore/Dexter Instead
In most cases, for most users, AFF will be the
better, easier-to-use tool for accessing SF’s.
Uex/Dex is useful for users who know what they
are looking for and may want more control over
filtering or output format.
The geographic summary unit may not be
available under AFF (e.g. RPC’s in Mo.)
The SF may not be available under AFF (e.g.
1980 STF3).
Summary Files
Set of 4 SF’s for each decade.
Summary Files 1 & 2 based on short form, 3
& 4 based on long form.
Summary Files 1 and 3 most widely used,
especially 3.
Within numbered SF’s there are lettered
subfiles, e.g. Summary File 3B or Summary
File 1C. These are based on geographic
coverage. C files, for example, are national
files, while A files are for individual states.
MCDC SF Datasets
These are “fat” files with lots of variables.
Rows correspond to geographic entities.
Character-type variables ID the entity
being summarized, numeric variables are
primarily the tabulated summary items.
Metadata standards vary over time.
Data dictionaries stored in archive.
SF Tables and Variables
A table consists of multiple cells of data.
Each cell is named <T#>i<cell#>, where
– <T#> is the table name, usually a letter &
number.
– i is literally the letter i, standing for “item”.
– <cell#> is the sequential cell # within the table
For example in sf32000 table P5 has 7
cells. The variables are named p5i1,
p5i2,…p5i7.
Table Types
In 1980 there were just plain tables,
without special prefixes. We used “t” as
the prefix to name the table cells, e.g.
t12i1 was the name of the first cell in Table
12.
In 1990 there were P and H tables.
In 2000 there are P, H, PCT and HCT
tables. (See notes).
Required Reading: Tech Doc
Trying to access a Summary File without first
looking at the technical doc is like going on a
trip without a map. (Only works if you’ve
been there before.)
American FactFinder is the best place to go
to find out what tables have what data – if the
file you want is included in AFF.
A datadict file in the mcdc data archive or
even a paper copy are other options.
What Tables, What Geography
When accessing a Summary File
dataset you should know ahead of time
what tables you want. (AFF may help).
You need to know what geographic
entities are of interest. Many of the SF
datasets will have multiple geographic
levels (e.g. state, county, place) that you
need to specify.
A Summary Level Sequence Chart
can be very helpful.
Access Summary File 3, 2000 Census
Start at uexplore home page and click
on Census/2000.
Click on the sf32000 filetype link.
Check out the SumLevs.html page.
Check out the Readme.html page.
On the Readme page look at the
Uexplore Access link.
This is hardly typical, having this much
metadata & guidance. We wish it were.
Excerpt From uexplore Section of
Readme.html
Sf32000 Query Specs
We want to extract data on the number and
percentage of minority households at the
census tract level for St. Louis City and
County.
Ignore any tracts with fewer than 100 total
households.
Want data in an Excel spreadsheet.
Hard part is knowing what minority means.
Note: St. Louis City (29510) is also a county (equivalent).
Questions for the Query
What dataset? (We assume we know the
directory/filetype.)
What output format?
What geographic areas within the dataset
– how to create the filter.
What variables?
What post-processing in Excel will we
have to do?
The sf32000 Datasets.html page
•Which dataset do we want?
We Want the moph Dataset
Because…
The universe is Missouri as needed.
It contains the P and H tables (not PCT or
HCT).
It has “All SF3A levels” of geography,
including census tract as required.
But now we need to see the details.
Note the size of the dataset – 1.3 Gigabytes!
The stf32000.moph Details Page
What We Learn from Details Page
From the Key variables reports for
SumLev and county we know we want
the 140 summary level for counties
29189 and 29510.
We get links to the data dictionary files
with variable names & labels.
We get a Usage Note explaining the
table-cell variable naming conventions.
A link to the Summary Level Sequence
chart.
Sample of a Summary Level
Sequence Chart (Partial)
Specify the Filter
First row selects census tract level summaries.
Second row selects the two counties of
interest.
Choose Columns/Tables
Selecting Tables
(instead of variables)
Only for a small number of special
filetypes. Mostly SF filetypes.
You choose table H10 and the program
translates this into selecting the columns
(variables) named h10i1, h10i2,…h10i17.
Note the scrollbar at right side of Tables
select list. You may have to scroll
horizontally to see this.
Feature was added late in 2004.
Waiting for Results
We get to see this for
about a whole minute.
It takes a while for
Dexter to slog thru all
that data. (A good
reason to avoid
sf32000 datasets
when sf32000x sets
will do.)
Wait for it to finish.
View Results: Summary Log
A brief summary of what
you asked for and what
you got.
286 rows (tracts) with 20
variables (columns).
Note the upcase
functions in the filter. All
character values entered
are upcased and
compared with upcased
database values. Of
course, when the
characters are all digits it
doesn’t matter.
Ready to Access Real Output
Click on Delimited File to access the generated csv file.
The (temporary) URL for the csv file is (for this example):
http://mcdc2.missouri.edu/tmpscratch/11JUL05_00021
.dexter/xtract.csv
This temporary directory and file lives for 2 days. You can
copy and paste the URL into an e-mail note and send it to
a colleague or client. Makes it easy to share queries.
Specify Variables by Typing Names
Not generally recommended because it is
error-prone but useful for short lists.
Useful in cases like these where you have
to select an entire table but all your really
want are a few cells.
You have to type the ID variables as well
as the numerics. When dexter detects you
typed something it ignores any selections
from the select lists.
Entering Table Cell Variables
Nothing is selected from Tables list & would not matter if it were.
You can only do this if you understand the table-cell naming
conventions. Instead of saving all 17 data cells in table H10, the
program will now save only the 3 specified cells.
The selection of geocode on Identifiers list is irrelevant.
Typical Result of Clicking on
Delimited File
What Are “Minority” Households
A household is “minority” if the head of the
HH is in a minority category.
Minority for 2000 means you are either:
– Hispanic or Latino, ---or—
– Not white (including multi-racial even if 1 of
those races is white).
So h10i1 – h10i3 is the formula to derive
mnority households. We do not need
h10i10 to derive it.
End Summary File Access
Section
End of Show
See related tutorials at:
http://mcdc2.missouri.edu/tutorials/dexter2.ppt
http://mcdc2.missouri.edu/tutorials/mcdc_data_archive.ppt
Download