Dexter The Missouri Census Data Center’s Data Extraction Utility John Blodgett: OSEDA, University of Missouri Rev.14May2007, jgb What Is Dexter? A web utility for performing simple data queries, or extracts. An integral part of the MCDC’s Uexplorebased data exploration/access system. Written in SAS© to access data stored in SAS datasets but requires no knowledge of nor access to SAS. Who Uses Dexter? Anyone interested in accessing the MCDC data archive, especially anyone who wants to directly access and manipulate the data. Not (directly) intended for the very casual data user. Has a small but non-trivial learning curve. Understanding the mechanics of Dexter is easy compared to understanding the data to be extracted. Dexter’s Role Within Uexplore Dexter accepts parameters that identify a database file/table from which data are to be extracted. Uexplore provides the navigation tools to help locate and understand the content of datasets. Uexplore hyperlinks actually invoke uex2dex, the dexter preprocessor, which in turn invokes dexter. Uexplore Page With Hyperlinks The URL Used to Invoke Dexter On the previous screen the dataset name (usccflows.sas7bdat) is a hyperlink. The URL associated with it is: http://mcdc2.missouri.edu/cgibin/broker?_PROGRAM=websas.uex2dex.sa s&_SERVICE=appdev9&path=/pub/data/mig2 000&dset=usccflows&view=0 It calls a program named uex2dex, written in SAS, and passes parms to ID the data table to be queried. Dexter and Census Data Dexter doesn’t really know much about the datasets from which it extracts data. It is not American FactFinder . It is just a generic extraction tool. It uses only very basic metadata tools. Other tools must be used to assist users in navigating the database. Dexter and the MCDC Data Archive Technically, there is nothing inherent in Dexter that ties it to this archive. In practice, however, the collection of public data files that we call the “MCDC Data Archive” is what Dexter was created for. It is very probable the only reason you’re reading this is because you want to access something in that archive. How Do You Invoke Dexter? Most people will start at the uexplore home page http://mcdc.missouri.edu/applications/uexplore.shtml You navigate the data collection by choosing “filetype” directories and at some point (…yada yada yada) you wind up selecting (clicking on) a file that is a data table. Clicking on the data table invokes the uex2dex preprocessor. You fill out the form which uex2dex generates and click on an “Extract Data” button to actually invoke Dexter. Accessing Uexplore (Home Page) From the MCDC home page (or any page with the navy blue navigation bar) click on “MCDC Data Archive”. Or enter the URL: http://mcdc2.missouri.edu/applications/uexplore.shtml Choose Major Category (from the links in teal box) Scroll Within the Filetype Descriptions to Find the Type (mig2000) Click on the Filetype Name (links to uexplore for that directory/filetype) In this case we want to click on the mig2000 filetype. The text tells us what kind of data we can expect to find in this directory. Uexplore Page - mig2000 Filetype This page is all about hyperlinks (all the blue text). Before proceeding to the the Dexter-invocation links we want to back up and look at the data archive structure. (Back to) The Uexplore/Dexter Home Page The Archive Directory (on the Uexplore/Dexter home page) The teal box contains links to 8 major data categories (2000 Census thru Compendia) The rest of the page consists mostly of descriptions of and hyperlinks to the archive’s data categories (which we refer to as filetypes.) Filetypes within the major categories are sorted in descending order of what we think will be their popularity. Sf32000x is our most popular filetype. What’s In the Archive? Very important question. But not the focus of this tutorial. Some day we’ll do a separate, long tutorial just on that topic. Not all filetypes are created equal. We spend 90% of our resources on maybe 10% of our data directories. Filetypes that are in bold are the MCDC “house specialties”. The Data Archive – General Info We keep the data table files (the things Dexter accesses) in the same directories along with other related files (metadata, spreadsheets, csv files, Readme.html files, etc.) Each filetype directory has a special Tools subdirectory where we keep program code and other tool modules related to the data. Subdirectories & Files starting with uppercase letters are listed first and are usually worth looking at. Dexter-accessible table files (“SAS datasets”) have extensions of sas7bdat or sas7bvew. Exercise The Bureau of Economic Analysis disseminates its REIS data with key economic indictors for US geography down to the county level. Locate the filetype corresponding to this data collection and navigate to the directory page. What’s the major category? Uexplore Data Directory Page What you see when you click on the beareis link on the Uexplore home page. It displays a list of files within the directory. The “File” column entries are hyperlinks. With a few exceptions the files are displayed in alphabetical order. Datasets.html is a special file providing enhanced navigation of the data files in this dir. It displays just the data-table files, but in a more logical order and with additional metadata. Datasets.html page Datasets.html Columns The Name column is also a link to uex2dex / dexter. Label is a short description of the dataset. #Rows (# of observations) and #Cols (# of columns/variables) are taken from the datasets metadata set. As are the Geographic Universe and Units. Link to Details is the most important column. Universe and Units The majority of datasets in the archive contain summary data for geographic areas. For example, a dataset in the popests directory might contain the latest estimates for all counties in the state of Missouri. The geographic universe is Missouri, and the units are counties. When we have many datasets in a directory it’s usually because we have many different combinations of universe and units. Common Universes Missouri (the state of) is by far the most common universe for the MCDC archive. United States is second – we have quite a number of national datasets. Illinois and Kansas are also very common since we routinely download and convert census files for these key neighbor states. A common sort order for files on Datasets.html pages is Missouri files first, then US, then IL/KS and then other states. Rows & Columns The rows of the data tables are typically geographic entities: a state, a county, a city, etc Most of the columns in the data tables are summary stats for the entity: e.g. the 2000 pop count, the latest estimated pop, the change and percent change, etc. Other columns (“variables”) are identifiers with names such as sumlev, geocode and areaname . Numeric vs. Character Variables SAS© stores data as character strings or as numerics. We store all identifiers (geographic codes, etc) as character strings even if they are made up of numeric digits. So the value of the state code for CT is “09”, not 9. The leading “0” matters. Unfortunately, Excel ignores the distinction when importing csv files. Dataset Naming Conventions All filetype names are 8 characters or less. Dataset names were limited to 8 characters by the software until recently. The first characters of the dataset name often correspond to the universe – e.g. “mo”, “il”, “us”. The geo units are often part of the ds-name – e.g. “motracts”, “uszips”. For time series data the name usually ends with a time indicator – e.g. “uscom03” contains data thru 2003. Variable Naming Conventions Not as rigorously applied as we might like, esp. for older datasets (conventions used for 1980 datasets differ a little from 2K and 1990 sets, for example) Certain names appear on many datasets and are consistent. These are mostly identifier variables, the ones used in creating filters and for merging data from different files. Consistency With Census Bureau Data Dictionary Names The Bureau often distributes data dictionary files with their data that include suggested names for the fields. Their name for the field (variable) containing the name of the geographic area being summarized is ANPSADPI. We decided to go with AreaName instead. But in most cases we try to use the same name as in the data dictionary. Common ID Variables SumLev: Geographic summary level codes as used in 2K census. (3-char) State: 2-char state FIPS code. County: 5-char county FIPS code, incl. the state. Geocode: A composite code to id a geographic area. E.g. the value for a census tract might be “29019-0010.00”. AreaName: Name of the area. Common ID Variables (cont) Tract: census tract in tttt.ss format, always 7 characters with leading 0s and 00 suffixes. E.g. “0012.00” . Esriid: Similar to geocode but intended to use as a key for linking to shape files from ESRI (the ArcInfo people). When geocode=“29019-0010.00” the value of esriid=“29019001000”. SAS Formats Some variables have custom formats associated with them, which cause them to display a name instead of their actual value. E.g. the variable County may have a value of “29019” but displays as “Boone MO” using the format. Most Dexter output has the formatted values. Click the “View qmeta Metadata report” option at the end of Section II on the Dexter form to see which variables have formats associated. More About the MCDC Data Archive http://mcdc2.missouri.edu/tutorials/ mcdc_data_archive.ppt Details Page We get here by clicking on the Details link on Datasets.html page. Lots of content here – but will vary. Key variables is often extremely useful when doing filters. Note the direct link to Dexter under Access the dataset near the bottom. Increase Text Size to Read Fine Print Exercise – Navigate to Dataset Earlier we were looking at datasets in the 2000 Census category, filetype mig2000. Go to the Uexplore home page and navigate to this filetype. Use the Datasets.html page to display the datasets within the directory. Find the row for the usccflows data table and click on the Details link for this table. From the Details page click on the keyvals link for the variable State. Key Variables Report: State Tells you that the variable State has a value of 01 (for “Alabama”) in 22137 rows of this dataset. This can be very helpful when doing a data filter in Dexter. Finally… Time to See Dexter Dexter Input Page (Top) Sec. I. Output Format(s): csv file (into Excel) most common. Sec. II is where the work is. Only 2 of 5 rows shown here. User fills out the entire form before using Extract Data button to invoke Dexter. Dexter Section II Filters “A filter is a logical condition that references values of columns within a row. For each row, the condition is evaluated and, if true, the row is selected for output. (If not, the row is omitted, or "filtered".) To keep all rows, just skip this section. The filter being created here can consist of up to 5 logical segments, each referencing a data set Variable, a relational Operator, and a data Value (or values) -- constants that the user must type in. The segments are evaluated as true or false. Logical operators (which default to And and appear between the segment specification rows) relate the segments when more than one is specified, creating a compound logical condition.” If this explanation makes sense to you then you are going to have an easy time with Dexter. If not, follow through the examples and then try reading it again. Example of Defining a Filter: What We Want Assuming we are running dexter to access the mig2000.usccflows dataset we want to select only those rows that: – have Missouri as the anchor state, and – have at least 100 gross flows. We’ll just assume you’ve read the descriptions and have some clue regarding what an anchor state and a gross flow are. (People interested in population migration would be likely to know this.) Select Variable for Filter Click on the Variable/Column drop-down menu in the 1st row and select State. Select Comparison Operator Select “Equal to” as the Operator from drop down menu in the middle column. Enter Value to Complete Row Remember the Key Values report showing all the values for the variable State? If you did not know the code for Missouri you could find it there. What We Have So Far We have created a logical condition that can be evaluated for each row of the dataset: State = ’29’ According the key values report for State we know that this condition is true for 38,316 rows in the dataset. The filter we are building will select just those 38,316 rows out of the 1.1+ million in the full dataset. Adding a Second Condition But we do not want all the cases pertaining to Missouri as the anchor state. We only want those where we have at least 100 gross flows (whatever those are). So we need to fill out a second row, adding this condition. We select GrossMig as the variable, Greater Than or Equal To as the Operator and enter 100 in the Value field. We leave the logical operator radio button set to “And” to indicate that this is an additional necessary condition. The Completed Filter You are now ready to scroll down to Section III. Section III: Choose Variables Conceptually simple section: just select the variables you want on your output from scrollable (if needed) menu lists. Identifiers (character type variables) are listed separate from numerics. Important MCDC Data Archive convention. Typing names instead of selecting is possible but not recommended. Here we select all variables except State. Section IV: Title & Sort Order Entirely optional, typically not used section. Title value is used as report title if you asked for one, which we did not in this example. Sort specs are handy. Note use of minus sign (hyphen) to indicate a descending sort. Another Extract Data button to use to run query. Dexter Output Page The first output you see is this results “index” page. Always a link to a Summary Log page Additional links depend on output formats requested. Dexter Summary Log This file always generated. Important for documenting the query. Indicates what file(s), when run, as well as any filter and the variables kept. Output directory details can usually be ignored. Select Output File(s) Click on Delimited File Link What happens when you click on this file depends on how your browser is configured. The file referenced has a .csv extension which IE usually associates with the Excel plugin. Clicking this link will typically invoke Excel. Viewing .csv Output in Excel The csv file is read into Excel. Rows 1 & 2 have names & labels. Other rows contain the selected data. Note sort order. Some Key Points So Far Navigation tools such as the uexplore home page, the uexplore directory navigator and Datasets.html reference pages are used to make accessing data with Dexter easier. You get to select rows (“filter”) and columns as well as the format(s) of your extracted data. Filtering often requires knowledge of code values. These can sometimes be accessed from the Key Values reports on the Details page referenced by a Datasets.html page. The query generated is summarized on a Summary.log page. Pop Quiz 1. 2. 3. 4. 5. 6. Can Dexter be used to access an xls file? How are the files sorted on a directory page displayed by uexplore? What does the uex2dex interface app do? What is the fastest way to tell how many rows were selected by your query? Which of the 5 sections of the Dexter query form must be filled out to have a valid request? What’s a filetype? What does it mean when one is displayed in bold on the Uexplore home (Archive Directory) page? Sample Query 2: What We Want We want data from the 2000 Census, Summary File 3 regarding poverty in Missouri – in cities and counties. We want the number and the % of poor persons, as well as the median household income. We only want the data for cities of at least 5000 persons, but for all counties and for the state as a whole. We want output as an HTML file sorted by the type of geography (state, county, city) and then by descending poverty rate. What You Need to Know You need to know where these kinds of data are stored. It is 2000 census data, but where among all those different summary files? Read the brief descriptions on the uexplore home page. The sf32000 filetype looks good, but it turns out that it is too big. The standard extract version, sf32000x, has what we need. An alternate way by which users may arrive here is via links on the MCDC Demographic Profile reports. A Demographic Profile Report A link at the bottom of this report page invokes Dexter with the appropriate dataset selected. Follow the link (in title of this page) and try it. (Back on uexplore home page) Click on sf32000x to Start Descriptions with links from the uexplore home page. The sf32000x Directory (As seen by uexplore) Subdirectories & files with upcased first letters are shown first. Index.html, Readme.html and, of course, Datasets.html are required reading (browsing). Files are in alphabetical (not logical) order. (sf32000x) Readme.html The Datasets.html Page (for the sf32000x filetype) Details Page -- sf32000x.moi Lots of info here. Most important is perhaps the Key variables link for variable SumLev (geographic summary level). Key Variables Report for SumLev (stf32000x.moi) Filters Based on SumLev Var SumLev Operator Value Equals 040 Results State Level Summary (only 1 row selected) 140 Census Tract Summaries – 1320 rows selected. 040:050:160 1 State level , 115 County level & 972 Place level rows. SumLev Equals SumLev In List Sample Query 2: What We Want (Repeated in case you forgot) We want data from the 2000 Census, Summary File 3 regarding poverty in Missouri cities and counties. We want the number and the % of poor persons, as well as the median household income. We only want the data for cities of at least 5000 persons, but for all counties and for the state. We want output as an HTML file sorted by the type of geography (state, county, city) and then by descending poverty rate. A Complex Filter The Filter Explained There are 2 logical parts to the filter: 1. SumLev In (‘040’,’050’) 2. Sumlev = ‘160’ and TotPop >= 5000 The parentheses checkboxes are used to group the 2nd & 3rd lines. The and between lines 2 and 3 is executed before the or between lines 1 and 2. The Filter Explained, cont. The SAS© code generatd by these menu choices : where sumlev in (‘040’,’050’) or (sumlev=‘160’ and totpop >=5000); The “in” operator (called “In List” on Operator pull-down menu) allows specifying that the value of a variable should be one of a list of values. Those values are entered separated by :’s in the Value column of the filter specs form. Completing the Query: Parts 3 & 4 HTML Output We see that Pemiscot has the highest poverty rate of any county. How do we know this? Why don’t we see any data for cities? Exercise Access the same dataset as in the example: sf32000x.moi Select census tract summaries in Greene co… … with a poverty rate of at least 10%. Keep all identifiers necessary to identify the tract, and all variables related to poverty. Generate a csv file and load it into a spreadsheet (probably Excel). Exercise 2 Repeat the previous exercise except do it for all counties (instead of census tracts) in the states of Arkansas and Oklahoma. Sort the results by descending poverty rate and generate output in pdf format as well as a csv file. Hint: A good place is start is with the Datasets.html page. Begin Summary File Processing Section Advanced section that can be skipped by many users. But note that AFF can be used instead to access most such data. Accessing Summary (Tape) Files The Census Bureau creates very large tablebased summary files. For each census since 1970. The MCDC has a good collection of such files for ’80, a few for ’90 and many for 2k. Filetype names begin “stf” or “sf” (the “t” was dropped in 2000.) E.g. stf803 for 1980 Summary Tape File 3, sf12000 for 2000 Summary File 1. Follow links off Census section of uexplore home page. Getting Started with S(T)Fs If you are new to using Census data and/or summary files we highly recommend that you use the American FactFinder application to become familiar with these files. From the AFF page: Under “Getting Detailed Data” follow the links to “About the Data” and then to “Data Sets” Experiment/practice locating and extracting tables for geographic areas of interest. Use the Census 2000 Summary File 3 (SF3) data set and specify you want “Detailed Tables”. Make use of the “by subject” & “by keyword” tabs to select tables. Exercise – Use AFF to Access 2000 Summary File 3 With Census 2000-SF3 chosen, use the Select Geography step to choose the state of Missouri and Boone county. Under Select Tables use “by subject” tab and search for tables related to poverty. Find a table that has data on # persons below 50% of poverty level. Display the relevant tables for the 2 geographic areas selected. When To Use Uexplore/Dexter Instead In most cases, for most users, AFF will be the better, easier-to-use tool for accessing SF’s. Uex/Dex is useful for users who know what they are looking for and may want more control over filtering or output format. The geographic summary unit may not be available under AFF (e.g. RPC’s in Mo.) The SF may not be available under AFF (e.g. 1980 STF3). Summary Files Set of 4 SF’s for each decade. Summary Files 1 & 2 based on short form, 3 & 4 based on long form. Summary Files 1 and 3 most widely used, especially 3. Within numbered SF’s there are lettered subfiles, e.g. Summary File 3B or Summary File 1C. These are based on geographic coverage. C files, for example, are national files, while A files are for individual states. MCDC SF Datasets These are “fat” files with lots of variables. Rows correspond to geographic entities. Character-type variables ID the entity being summarized, numeric variables are primarily the tabulated summary items. Metadata standards vary over time. Data dictionaries stored in archive. SF Tables and Variables A table consists of multiple cells of data. Each cell is named <T#>i<cell#>, where – <T#> is the table name, usually a letter & number. – i is literally the letter i, standing for “item”. – <cell#> is the sequential cell # within the table For example in sf32000 table P5 has 7 cells. The variables are named p5i1, p5i2,…p5i7. Table Types In 1980 there were just plain tables, without special prefixes. We used “t” as the prefix to name the table cells, e.g. t12i1 was the name of the first cell in Table 12. In 1990 there were P and H tables. In 2000 there are P, H, PCT and HCT tables. (See notes). Required Reading: Tech Doc Trying to access a Summary File without first looking at the technical doc is like going on a trip without a map. (Only works if you’ve been there before.) American FactFinder is the best place to go to find out what tables have what data – if the file you want is included in AFF. A datadict file in the mcdc data archive or even a paper copy are other options. What Tables, What Geography When accessing a Summary File dataset you should know ahead of time what tables you want. (AFF may help). You need to know what geographic entities are of interest. Many of the SF datasets will have multiple geographic levels (e.g. state, county, place) that you need to specify. A Summary Level Sequence Chart can be very helpful. Access Summary File 3, 2000 Census Start at uexplore home page and click on Census/2000. Click on the sf32000 filetype link. Check out the SumLevs.html page. Check out the Readme.html page. On the Readme page look at the Uexplore Access link. This is hardly typical, having this much metadata & guidance. We wish it were. Excerpt From uexplore Section of Readme.html Sf32000 Query Specs We want to extract data on the number and percentage of minority households at the census tract level for St. Louis City and County. Ignore any tracts with fewer than 100 total households. Want data in an Excel spreadsheet. Hard part is knowing what minority means. Note: St. Louis City (29510) is also a county (equivalent). Questions for the Query What dataset? (We assume we know the directory/filetype.) What output format? What geographic areas within the dataset – how to create the filter. What variables? What post-processing in Excel will we have to do? The sf32000 Datasets.html page •Which dataset do we want? We Want the moph Dataset Because… The universe is Missouri as needed. It contains the P and H tables (not PCT or HCT). It has “All SF3A levels” of geography, including census tract as required. But now we need to see the details. Note the size of the dataset – 1.3 Gigabytes! The stf32000.moph Details Page What We Learn from Details Page From the Key variables reports for SumLev and county we know we want the 140 summary level for counties 29189 and 29510. We get links to the data dictionary files with variable names & labels. We get a Usage Note explaining the table-cell variable naming conventions. A link to the Summary Level Sequence chart. Sample of a Summary Level Sequence Chart (Partial) Specify the Filter First row selects census tract level summaries. Second row selects the two counties of interest. Choose Columns/Tables Selecting Tables (instead of variables) Only for a small number of special filetypes. Mostly SF filetypes. You choose table H10 and the program translates this into selecting the columns (variables) named h10i1, h10i2,…h10i17. Note the scrollbar at right side of Tables select list. You may have to scroll horizontally to see this. Feature was added late in 2004. Waiting for Results We get to see this for about a whole minute. It takes a while for Dexter to slog thru all that data. (A good reason to avoid sf32000 datasets when sf32000x sets will do.) Wait for it to finish. View Results: Summary Log A brief summary of what you asked for and what you got. 286 rows (tracts) with 20 variables (columns). Note the upcase functions in the filter. All character values entered are upcased and compared with upcased database values. Of course, when the characters are all digits it doesn’t matter. Ready to Access Real Output Click on Delimited File to access the generated csv file. The (temporary) URL for the csv file is (for this example): http://mcdc2.missouri.edu/tmpscratch/11JUL05_00021 .dexter/xtract.csv This temporary directory and file lives for 2 days. You can copy and paste the URL into an e-mail note and send it to a colleague or client. Makes it easy to share queries. Specify Variables by Typing Names Not generally recommended because it is error-prone but useful for short lists. Useful in cases like these where you have to select an entire table but all your really want are a few cells. You have to type the ID variables as well as the numerics. When dexter detects you typed something it ignores any selections from the select lists. Entering Table Cell Variables Nothing is selected from Tables list & would not matter if it were. You can only do this if you understand the table-cell naming conventions. Instead of saving all 17 data cells in table H10, the program will now save only the 3 specified cells. The selection of geocode on Identifiers list is irrelevant. Typical Result of Clicking on Delimited File What Are “Minority” Households A household is “minority” if the head of the HH is in a minority category. Minority for 2000 means you are either: – Hispanic or Latino, ---or— – Not white (including multi-racial even if 1 of those races is white). So h10i1 – h10i3 is the formula to derive mnority households. We do not need h10i10 to derive it. End Summary File Access Section End of Show See related tutorials at: http://mcdc2.missouri.edu/tutorials/dexter2.ppt http://mcdc2.missouri.edu/tutorials/mcdc_data_archive.ppt