CS2304 Spring 2014 Project 3 Goal The Bureau of Labor Statistics

advertisement
CS2304 Spring 2014 Project 3 Goal The Bureau of Labor Statistics maintains data sets on many different things, from work place injuries to consumer spending habits, but what you most frequently hear about is employment. Conveniently, much of BLS’s data is available online, and can be accessed using HTTP requests (GET/POST). For this project we’re going to write a program that will allow us to access some of those data sets. During this project you’ll obtain some experience with Python dictionaries and work with a few Python standard library packages. Program Interface and Output The program takes one command line parameter: an input file name. The input file starts with 2 lines of header text that can be discarded, followed by an arbitrary number of data lines. Each line in the file represents a data series, that may or may not exist and that can potentially be fetched. Here’s a short sample input file with three series: Industry
SA
Data Type
Start
End
-----------------------------------------------Total nonfarm U
ALL EMPLOYEES, THOUSANDS 1995
1995
Aircraft
U
WOMEN WORKERS, THOUSANDS 2003
2007
Millwork
S
AVERAGE HOURLY EARNINGS, 1982 DOLLARS 2003 2007
While it’s not clear from this example, each column in the file is tab separated, so you can “split” using tab characters (“\t”). You may assume the input file is correctly structured. Each column contains a relevant piece of information needed to fetch a series. The Industry column provides the name of the industry we are examining, while the SA column tells whether we are requesting a data series that is Seasonally Adjusted (S) or Unadjusted (U). There are a few types of available information, which are specified by the Data Type. Finally, Start and End provide the starting and ending year for the data we are requesting. The Industry and Data Type information come from two additional input files (described later) that are always opened and processed. These input files contain the mapping of human-­‐readable names like “Total nonfarm” to a numerical code that can be used to create a series ID number. To invoke the program: [cmdprompt$] python3 blsrequest.py input.txt
The first and second series exist, so we print out the available data for each month of each year starting with the most recent month/year. The third series doesn’t exist or doesn’t have data for those years, and we print a message letting the user know. So running our program with the CS2304 Spring 2014 Project 3 input file above should produce the following output, printing each series ID, the human-­‐
readable information, and any data found: Series EEU00000001
Total nonfarm, Unadjusted, ALL EMPLOYEES, THOUSANDS, 1995-1995
Data found:
December 1995 118918
November 1995 118917
October 1995 118665
September 1995 118083
August 1995 117180
July 1995 116926
June 1995 118138
May 1995 117409
April 1995 116674
March 1995 115849
February 1995 115093
January 1995 114435
Series EEU31372102
Aircraft, Unadjusted, WOMEN WORKERS, THOUSANDS, 2003-2007
All of the requested years are not available.
Data Found:
February 2003 41.4
January 2003 43.2
Series EES31243149
Millwork, Adjusted, AVERAGE HOURLY EARNINGS, 1982 DOLLARS, 2003-2007
The series doesn’t exist or have data for given years.
BLS Information The available data is broken down into series on the BLS website and each series has a code (a series ID) that you will need to create before trying to make a request. While there are many types of data sets available, we are only going to focus on the “National Employment, Hours, and Earnings (SIC basis)” data series. For our purposes, each series ID looks like: EES10140001. Each piece of the ID has a different meaning shown in the table below from the BLS website: Series ID
EES10140001
Positions
Value
Field Name
1-2
EE
Prefix
3
S
Seasonal Adjustment Code
4-9
101400
Industry Code
10-11
01
Data Type Code
CS2304 Spring 2014 Project 3 So, every series ID we’ll create starts with EE, is either seasonally adjusted (S, in this case) or unadjusted (U), has a 6 digit Industry Code, and a 2 digit Data Type Code. The Industry codes and corresponding human-­‐readable names can be found in ee.industry.txt, while the Data Type codes and human-­‐readable names can be found in ee.datatype.txt. Both files can be found on the course website, or on BLS website here: http://download.bls.gov/pub/time.series/ee/ee.industry http://download.bls.gov/pub/time.series/ee/ee.datatype Like the input file these are tab separated. You may assume these files always exist and will be present in the same directory as your Python code. There are only a couple columns in each you need to worry about. Looking at the table above and those files, I can say EES10140001 is seasonally adjusted, the Industry Code is “Nonmetallic minerals, except fuels”, and the Data Type is “ALL EMPLOYEES, THOUSANDS”. Conversely, I could use the human-­‐readable names in the files above (like in the input file) to derive the series ID. For more information, you may these links as reference: http://www.bls.gov/help/hlpforma.htm#EE http://www.bls.gov/developers/api_signature.htm http://www.bls.gov/developers/api_faqs.htm JSON Once we have created the series IDs, we need to send them to BLS. JSON is the data format we’ll be using to request and receive series. The series IDs (and other info) must be marshaled into a dictionary then converted into a JSON string prior to making a request. What’s JSON? From json.org: “JSON (JavaScript Object Notation) is a lightweight data-­‐interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-­‐262 3rd Edition -­‐ December 1999. “ So, JSON is an easy, language independent way to store complex structures/values in strings and then transfer them between computers, where the information can be unpacked and used. Here’s some more from json.org: “JSON is built on two structures: • A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array. • An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.” CS2304 Spring 2014 Project 3 Here is an example of the type of Python dictionary we need to convert to JSON before transmitting the request to BLS. bls = {"seriesid":[ "EEU10140001", "EES10140002"],
"startyear":"2010",
"endyear":"2012"
}
So we have dictionary with 3 name and value pairs. “seriesid”’s value is a list of strings (the series IDs), while “startyear” has value “2010” and “endyear” has value “2012”. We’ll convert this to a JSON string, which can be used to fetch the two series: EEU10140001 and EES10140002 for years from 2010-­‐2012, assuming the data exists. Conveniently, Python provides functions to covert JSON to and from the string representation: import json
# bls is dictionary, which contains a list and the dates
# bls[“seriesid”] is the list, so bls[“seriesid”][0] = “EEU10140001”
# and bls[“startyear”] is “2010”
#
# When you need to make a HTTP request you can
# turn the Python objects into a JSON string
jsonstr = json.dumps(bls)
# So that dictionary is represented as string in “jsonstr”
print(jsonstr)
# ‘{"seriesid":[ "EEU10140001", "EES10140001"],
#
#
#
"startyear":"2010",
"endyear":"2012"
}’
# And we can turn the string back into Python data types.
# This will be useful when receiving series data.
result = json.loads(jsonstr)
Making HTTP Requests Now that we know what JSON is, we can look at using JSON to pass information back and forth between our computer and a webserver. Remember BLS’s servers have all of the information we want and we are going to use HTTP requests (GET/POST) to access the appropriate data. Like before Python has a package, urllib, to help us out. I’d suggest creating a Request object, and using urlopen to transmit and receive the required information. An example is shown below with carefully chosen parameters. CS2304 Spring 2014 Project 3 import urllib.request
import json
payload = json.dumps(bls)
# You shouldn’t need to change anything here.
# Just change the value of the payload.
r = urllib.request.Request(
"http://api.bls.gov/publicAPI/v1/timeseries/data/",
payload.encode("utf-8"),
{"Content-Type": "application/json"})
# Get the response
result = urllib.request.urlopen(r)
# Get the data returned by the server.
# This is a JSON string with our series information.
resultstr = result.read().decode(“utf-8”)
The resultstr will be in JSON format and you’ll be able to use json.loads() to turn the information from the server into Python data types. Once you’ve used json.loads() you’ll have a large nested dictionary and list structure, and you’ll have experiment some to get to the right values. Take a look the links in the BLS links for examples of the exact format. Note: BLS limits you to a 10 year span in one HTTP request, so for time spans longer than 10 years (1960 – 1985), you’ll need to make more than one JSON string and HTTP request. Summary and Hints This project has a lot of small pieces and some new topics to understand, but it’s actually not that many lines of code if you use the suggestions above. Below, I’ve also broken down what you need to do into different steps: Be able to read the information from ee.industry.txt and ee.datatype.txt. I’d put the codes and human-­‐readable names in dictionaries. Then you can covert the human-­‐readable names to the appropriate code and vice versa. Once you’ve done that I’d use those dictionaries to build the required series IDs. You’ll want to put those into a list and add the list to a dictionary. Once you’ve created the dictionary with series ID(s) and a start and end year, you can convert that into a JSON string. Make the HTTP request using the JSON string. I’d suggest making one request per series for simplicity but you can combine up to 25 series in a single request, they all share the same start and end date though. Once you’ve received a response you can convert the JSON payload into Python data structures and iterate through them to print out the required information. You may need to make more than HTTP request to get all of the data. CS2304 Spring 2014 Project 3 Transient Errors While testing I occasionally received a response that looked like this: {"status":"REQUEST_FAILED","responseTime":0,"message":["Your request
has failed, please check your input parameters and try your request
again."],"Results":null}
Note that “Results” is null (or None once we convert it Python data structures) rather than a list of dictionaries with empty “data” lists. The error seems to be transient, the same code could work one minute and give me the error later on. I’ve tried the code on a few computers and encountered the same error using Curl, so I am fairly certain it’s not a local issue or a code issue. If I had caught the error earlier on I may have changed the project substantially. Given this situation, since we’ll be testing the project live on the Curator and we don’t control BLS’s resources, we need to have a backup plan. If you receive the error above or if BLS servers become other wise unreachable, you should print the series ID and information like before, and the print dictionaries you converted to JSON then tried to send to the BLS website. Series EEU00000001
Total nonfarm, Unadjusted, ALL EMPLOYEES, THOUSANDS, 1995-1995
{'seriesid': ['EEU00000001'], 'endyear': '1995', 'startyear':
'1995'}
Series EEU31372102
Aircraft, Unadjusted, WOMEN WORKERS, THOUSANDS, 2003-2007
{'seriesid': ['EEU31372102'], 'endyear': '2007', 'startyear':
'2003'}
Series EES31243149
Millwork, Adjusted, AVERAGE HOURLY EARNINGS, 1982 DOLLARS, 2003-2007
{'seriesid': ['EES31243149'], 'endyear': '2007', 'startyear':
'2003'}
Here is an example with a timespan greater than 10 years being broken into multiple requests: Series EEU00000001
Total nonfarm, Unadjusted, ALL EMPLOYEES, THOUSANDS, 1985-2010
{'seriesid': ['EEU00000001'], 'endyear': '2010', 'startyear':
'2000'}
{'seriesid': ['EEU00000001'], 'endyear': '1999', 'startyear':
'1989'}
{'seriesid': ['EEU00000001'], 'endyear': '1988', 'startyear':
'1985'}
CS2304 Spring 2014 Project 3 Submitting Your Work You will submit a single .py file, containing nothing but the implementation described above. Be sure to conform to any specified function interfaces. Your submission will be executed with a test driver and graded according to how many cases your solution handles correctly. This assignment will be graded automatically. You will be allowed up to ten submissions for this assignment. Test your function thoroughly before submitting it. Make sure that your function produces correct results for every test case you can think of. The course policy is that the submission that yields the highest score will be checked. If several submissions are tied for the highest score, the latest of those will be checked. The link to the submit page is located here: http://curator.cs.vt.edu:8080/2014SpringS04/index.jsp Pledge: Each of your program submissions must be pledged to conform to the Honor Code requirements for this course. Specifically, you must include the following pledge statement in the submitted file: # On my honor: # # -­‐ I have not discussed the Python code in my program with # anyone other than my instructor or the teaching assistants # assigned to this course. # # -­‐ I have not used Python code obtained from another student, # or any other unauthorized source, either modified or unmodified. # # -­‐ If any Python code or documentation used in my program # was obtained from another source, such as a text book or course # notes, that has been clearly noted with a proper citation in # the comments of my program. # # -­‐ I have not designed this program in such a way as to defeat or # interfere with the normal operation of the Curator System. # # <Student Name> Failure to include this pledge in a submission is a violation of the Honor Code. 
Download