Effective programming practices for economists 12. Sensible project

advertisement
Effective programming practices for economists
12. Sensible project layouts
Hans-Martin von Gaudecker
Department of Economics, Universität Mannheim
Licensed under the Creative Commons Attribution License
1/52
At the end of this lecture you are able to . . .
I
Reflect on organisational structures for research projects.
I
Work with hierarchical builds in Waf.
I
Differentiate between different “install locations”, handle
svn:externals.
I
Understand relative merits of absolute and relative paths.
I
Construct and use Python packages.
I
Pass options to code in different ways.
I
Read and write JSON-formatted files.
I
Work with the project template provided on the server.
Licensed under the Creative Commons Attribution License
2/52
Need some organisation for complex projects
I
So far, files were mostly assumed to live in one directory.
I
Regardless of type or purpose: Input, output, data, source
code, results, build scripts . . .
I
This does not scale up.
I
Need to divide things up over various directories.
I
No golden rules, but some underlying principles.
Licensed under the Creative Commons Attribution License
3/52
Principles for project organisation
I
Group files with a common purpose in a directory.
I
Encapsulation – related chunks should be stand-alone.
I
Separate sources and output.
I
Make it easy to find often-used outputs.
I
Make it easy to understand for both humans and computers.
I
DRY!
Licensed under the Creative Commons Attribution License
4/52
project_root
waf
wscript
documentation
bld
[bld]
[project documentation]
src
[project documentation.pdf]
out
[research_paper.pdf]
wscript
out
conf.py
index.rst
data
[research_pres_30min.pdf]
analysis
[research_pres_90min.pdf]
final
src
figures
tables
src
introduction.rst
library
original_data.rst
data_management.rst
wscript
analysis.rst
stata
final.rst
python
paper.rst
etc.
library.rst
wscript
models.rst
stata
ado_ext
ado_local
manual_input
documentation
wscript
library
some_figure.pdf
manual_input
some_table.tex
models
original_data
original_data
data_management
analysis
etc.
models
dataset_1
final
dataset_2
paper
documentation
wscript
baseline.json
robust_unobs_het.json
etc.
analysis
data_management
wscript
descriptives.do
wscript
regressions_intuition.do
clean_dataset_1.do
serious_approach.py
clean_dataset_2.do
etc.
etc.
final
wscript
create_tables.py
simple_simulations.py
etc.
paper
wscript
bib (*)
formulas
formulas
research_paper.tex
research_pres_30min.tex
utility_function.py
research_pres_90min.tex
budget_constraint.py
all_tables.tex
etc.
all_figures.tex
Project hierarchies: The highest level
project_root
waf
wscript
[bld]
bld
[project documentation]
[project documentation.pdf]
src
[research_paper.pdf]
out
out
[research_pres_30min.pdf]
[research_pres_90min.pdf]
data
src
analysis
src
final
figures
wscript
tables
documentation
library
manual_input
models
original_data
data_management
analysis
final
paper
Licensed under the Creative Commons Attribution License
6/52
Hierarchical builds
I
Writing all build instructions in a single wscript file would
quickly become a mess.
I
Add wscript file to directory down the hierarchy, only
required content is:
def build(ctx):
ctx(features='...', ...)
I
The parent directory’s wscript should contain something like:
def build(ctx):
ctx(features='...', ...)
ctx.recurse('child_dir')
Licensed under the Creative Commons Attribution License
7/52
Hierarchical builds
I
All target / source / deps file names are interpreted relative
to the location of the wscript file.
I
When executing tasks, Waf sets the working directory to bld
(or whatever you specified in the main wscript for out).
I
Some tools work differently (LATEX).
I
Set the launch directory in Eclipse / Stata / . . .
I
Implies you have to be careful when loading data, running
other files, etc..
Licensed under the Creative Commons Attribution License
8/52
Putting things together
Reminder of the structure
project_root
waf
wscript
[bld]
[project documentation]
[project documentation.pdf]
[research_paper.pdf]
[research_pres_30min.pdf]
[research_pres_90min.pdf]
src
src
wscript
documentation
library
manual_input
models
original_data
data_management
analysis
final
paper
Licensed under the Creative Commons Attribution License
9/52
Putting things together
wscript
def configure(ctx):
ctx.load('biber')
ctx.load('run_py_script')
ctx.load('run_do_file')
ctx.load('sphinx_build')
ctx.load('write_project_headers')
def build(ctx):
ctx.env.PROJECT_PATHS = set_project_paths(ctx)
ctx.path_to = path_to
ctx.recurse('src')
Licensed under the Creative Commons Attribution License
10/52
Putting things together
src/wscript
def build(ctx):
ctx.recurse('library')
ctx.recurse('models')
ctx.recurse('manual_input')
# Adding the build group is required because the project
# paths have to be generated first. It should not entail
# any performance penalties -- all tasks before this
# point generally run very fast.
ctx.add_group()
ctx.recurse('data_management')
ctx.recurse('analysis')
ctx.recurse('final')
# Safety measure -- else the LaTeX scanner might fail becau
# '.tex'-source file has not been generated.
ctx.add_group()
ctx.recurse('paper')
ctx.recurse('documentation')
Licensed under the Creative Commons Attribution License
11/52
Do the analysis step-by-step . . .
0. Library, models, manual figures / tables.
1. Data management.
2. Actual analysis (estimation / simulation).
3. Visualisation and results formatting.
4. Combine things into a paper and presentations.
Licensed under the Creative Commons Attribution License
12/52
Project hierarchies:
Aside: Manual figures and tables
Just copy everything to tables/figures directories.
src
wscript
documentation
library
manual_input
manual_input
models
original_data
data_management
analysis
final
wscript
some_figure.pdf
some_table.tex
etc.
paper
Licensed under the Creative Commons Attribution License
13/52
Project hierarchies:
The code library
src
wscript
documentation
library
manual_input
library
models
original_data
wscript
data_management
stata
analysis
python
final
etc.
paper
stata
ado_ext
ado_local
Licensed under the Creative Commons Attribution License
14/52
Absolute or relative paths?
I
Absolute path:
C:\Documents and Settings\me\my_project\data
/Users/me/my_project/data
I
Relative path:
..\..\data
../../data
I
Relative paths are more portable across machines . . .
I
I
Slash / backslash not too relevant anymore (Stata, Python).
. . . but paths are relative to the location of your interpreter
/ program; rather than the do-file.
Licensed under the Creative Commons Attribution License
15/52
Finding your way around with relative paths
I
I
Programs / interpreters typically start in some default
directory, unless you start them in a special way.
I
Start from the shell → usually the same directory
I
Open a file → often that directory
I
No guarantees.
Remember Waf launches tasks in bld.
Licensed under the Creative Commons Attribution License
16/52
Finding your way around with relative paths
I
I
I
Python:
I
Find out current path with os.getcwd()
I
Change directories with os.chdir("path/to/some/dir")
Stata:
I
Find out current path with display c(pwd)
I
Change directories with cd "path/to/some/dir"
Try to avoid changing the path within programs, rather
prepend the path to load/save statements etc..
Licensed under the Creative Commons Attribution License
17/52
Use auto-generated absolute paths
I
Specify all relevant paths as Waf nodes in the main wscript.
I
Create tasks that generate headers with absolute paths for
all required languages as part of the build.
I
Best of both worlds:
I
I
Portable across machines.
I
No ambiguity with respect to launch directory.
I
No need to worry about changing directories in scripts.
I
No problems of including files at various hierarchical levels.
Pre-implemented for Stata and Python in project template.
Licensed under the Creative Commons Attribution License
18/52
Stata search paths
I
Stata looks for do-files in its current directory.
I
I
Stata looks for programs (ado-files) in the directories
specified in adopath.
I
I
I
Run those with do filename / include filename (the
latter keeps the local macros defined in filename.do in
memory).
For internal use (sysdir) or user-specified additions.
E.g. PLUS is used when you install stuff via ssc install,
PERSONAL is self-explanatory.
Change things so projects become more self-contained.
Licensed under the Creative Commons Attribution License
19/52
def set_project_paths(ctx):
"""Return a dictionary with project paths represented by Waf nodes."""
pp = {}
# The PROJECT_ROOT path will be appended to the PYTHONPATH environmental var
# Do the same in the Eclipse project settings, if applicable.
pp['PROJECT_ROOT'] = '.'
pp['IN_DATASET_1'] = 'src/original_data/dataset_1'
pp['IN_LIBRARY'] = 'src/library'
pp['IN_MODELS'] = 'src/models'
pp['OUT_DATA'] = '{}/out/data'.format(out)
pp['OUT_ANALYSIS'] = '{}/out/analysis'.format(out)
pp['OUT_FINAL'] = '{}/out/final'.format(out)
pp['OUT_FIGURES'] = '{}/out/figures'.format(out)
pp['OUT_TABLES'] = '{}/out/tables'.format(out)
# Stata's adopaths get special treatment.
lib = pp['IN_LIBRARY']
pp['ADO'] = {}
pp['ADO']['PERSONAL'] = os.path.join(lib, 'stata/ado_ext/personal')
pp['ADO']['PLUS'] = os.path.join(lib, 'stata/ado_ext/plus')
pp['ADO']['LOCAL'] = os.path.join(lib, 'stata')
# Convert the directories into Waf nodes.
for key, val in pp.items():
if not key == 'ADO':
pp[key] = ctx.path.make_node(val)
else:
for adokey, adoval in val.items():
pp[key][adokey] = ctx.path.make_node(adoval)
return pp
Licensed under the Creative Commons Attribution License
20/52
Specifying paths — I. The main wscript
def path_to(ctx, pp_key, *args):
"""Return the relative path to os.path.join(*args*) in the directory
PROJECT_PATHS[pp_key] as seen from ctx.path (i.e. the directory of the
current wscript).
Use this to get the relative path---as needed by Waf---to a file in one
of the directory trees defined in the PROJECT_PATHS dictionary above.
We always pretend everything is in the source directory tree, Waf takes
care of the correct placing of targets and sources.
"""
path_in_tree = os.path.join(args)
node = ctx.env.PROJECT_PATHS[pp_key].find_or_declare(path_in_tree).get_src()
return node.path_from(ctx.path)
def build(ctx):
ctx.env.PROJECT_PATHS = set_project_paths(ctx)
ctx.path_to = path_to
ctx.recurse('src')
Licensed under the Creative Commons Attribution License
21/52
Specifying paths
II. src/library/stata/wscript
def build(ctx):
ctx(features='write_project_paths', target='project_paths.do')
I
Task generator automatically recognises output format for
the header file based on the extension.
I
Analogously in src/library/python/wscript.
Licensed under the Creative Commons Attribution License
22/52
Specifying paths
III. Resulting Stata header file
//
//
//
//
//
//
//
//
//
Header with path definitions for entire project.
Automatically generated by Waf, do not change!
If paths need adjustment, perform those in the root wscript file.
Note that the paths are added to the top of the ado-path.
sysdir set
adopath ++
adopath ++
sysdir set
global
global
global
global
global
global
global
global
global
PERSONAL "/Users/project/src/library/stata/ado_ext/personal/"
"/Users/project/src/library/stata/"
"/Users/project/bld/src/library/stata/"
PLUS "/Users/project/src/library/stata/ado_ext/plus/"
PATH_IN_DATASET_1 "/Users/project/src/original_data/dataset_1/"
PATH_IN_LIBRARY "/Users/project/src/library/"
PATH_IN_MODELS "/Users/project/src/models/"
PATH_OUT_ANALYSIS "/Users/project/bld/out/analysis/"
PATH_OUT_DATA "/Users/project/bld/out/data/"
PATH_OUT_FIGURES "/Users/project/bld/out/figures/"
PATH_OUT_FINAL "/Users/project/bld/out/final/"
PATH_OUT_TABLES "/Users/project/bld/out/tables/"
PATH_PROJECT_ROOT "/Users/project/"
Licensed under the Creative Commons Attribution License
23/52
Specifying paths
IV. Example Stata file
// Header do-file with path definitions, those end up in local macros.
include src/library/stata/project_paths
log using `"${PATH_OUT_ANALYSIS}/log/`1'.log"', replace
// Delete these lines -- just to check whether everything caught correctly.
adopath
macro list
I
Note that run_do_file passes the do-file’s name without
the .do extension as the first argument.
I
This ends up in the local macro `1'.
Licensed under the Creative Commons Attribution License
24/52
Specifying paths
V. Task for running the Stata file
def build(ctx):
def out_analysis(*args):
return ctx.path_to(ctx, 'OUT_ANALYSIS', *args)
def out_data(*args):
return ctx.path_to(ctx, 'OUT_DATA', *args)
# Illustrate simple use of run_do_file.
ctx(features='run_do_file',
source='descriptives.do',
target=out_analysis('log', 'descriptives.log'),
deps=['../library/stata/project_paths.do',
out_data('streg_example_data.signature')],
name='descriptives_example')
Licensed under the Creative Commons Attribution License
25/52
Aside: Guide to adding Stata packages
Changing the system directory PLUS causes previously made
system-wide additions to become unavailable. To add something:
1. Launch a Stata GUI session.
2. Copy the line starting with sysdir set PLUS from
bld/src/library/stata/project_paths.do and paste it
into the Stata command prompt.
3. Install your package, e.g. ssc install tabout.
4. Commit your changes to the scripts-library repository
(you may have to ask me for write permissions, or you could
set up an own project like that).
Licensed under the Creative Commons Attribution License
26/52
Python packages
I
Remember Python looks for files on sys.path.
I
Spread modules across different directories using packages.
I
I
Add a (usually empty) file called __init__.py to
subdirectories you want to be part of your package.
Then import some_function defined in module.py from
subdir with:
from subdir.module import some_function
I
Make sure the top directory is on sys.path (hierarchical).
I
The structure nests (need __init__.py at every level).
Licensed under the Creative Commons Attribution License
27/52
"""Define a dictionary *project_paths* with path
definitions for the entire project.
This module is automatically generated by Waf, never change it!
If paths need adjustment, change them in the root wscript file.
"""
import os
project_paths = {}
project_paths['IN_DATASET_1'] = r'/Users/project/src/original_data/dataset_1'
project_paths['IN_LIBRARY'] = r'/Users/project/src/library'
project_paths['IN_MODELS'] = r'/Users/project/src/models'
project_paths['OUT_ANALYSIS'] = r'/Users/project/bld/out/analysis'
project_paths['OUT_DATA'] = r'/Users/project/bld/out/data'
project_paths['OUT_FIGURES'] = r'/Users/project/bld/out/figures'
project_paths['OUT_FINAL'] = r'/Users/project/bld/out/final'
project_paths['OUT_TABLES'] = r'/Users/project/bld/out/tables'
project_paths['PROJECT_ROOT'] = r'/Users/project'
def project_paths_join(key, *args):
"""Given input of a *key* in the *project_paths* dictionary and a number
of path arguments *args*, return the joined path constructed by::
os.path.join(project_paths[key], *args)
"""
return os.path.join(project_paths[key], *args)
Specifying paths
VII. Example Python file
"""Example file demonstrating how to import the project_paths_join
convenience function.
"""
from bld.src.library.python.project_paths import project_paths_join
out_path = project_paths_join('OUT_ANALYSIS', 'simulation_results.txt')
with open(out_path, 'w') as results_file:
results_file.write('This is a simple test.\n')
I
run_py_script adds the project root directory to
PYTHONPATH.
Licensed under the Creative Commons Attribution License
29/52
Project hierarchies:
Model specifications
src
wscript
documentation
library
manual_input
models
original_data
models
data_management
analysis
final
paper
wscript
baseline.json
robust_unobs_het.json
etc.
Licensed under the Creative Commons Attribution License
30/52
Where to put model parameters?
And how to best store them?
I
I
Essentially the same issue as with paths:
I
(Might) need them in multiple languages, . . .
I
. . . at least one language and in Waf.
But it goes deeper than that . . .
Licensed under the Creative Commons Attribution License
31/52
Organising the workflow . . .
. . . by steps of the analysis?
Licensed under the Creative Commons Attribution License
32/52
Organising the workflow . . .
. . . by model?
Licensed under the Creative Commons Attribution License
33/52
Organising the workflow . . .
. . . what if not everything is used at every step?
Licensed under the Creative Commons Attribution License
34/52
Organising the workflow . . .
. . . how to minimise code duplication?
I
Write code by step of the analysis; think of model
specifications as libraries?
I
Incorporate all models via a for-loop . . .
forvalues m = 1 / 7 {
include "`PATH_LIBRARY'/models/model`m'"
}
regress `depvar' `exogvars'
I
But only if execution time is close to negligible.
I
Else you don’t want to re-run all 3 (10? 20?) models if you
change the assumptions of one of them.
I
Difficult to avoid running all steps for all models.
Licensed under the Creative Commons Attribution License
35/52
Organising the workflow . . .
. . . how to minimise code duplication?
I
Write code by model specification; think of the actual
computations as libraries?
local depvar = "ln_income"
local exogvars = "education female"
do ../data_management/data_management_main
do ../analysis/analysis_main
do ../final/final_main
I
Problematic if upstream steps take very long.
I
Difficult to avoid running all steps for all models.
Licensed under the Creative Commons Attribution License
36/52
Organising the workflow . . .
. . . how to reach Waf’s bliss point?
Licensed under the Creative Commons Attribution License
37/52
Organising the workflow . . .
. . . how to reach Waf’s bliss point?
I
Directory organisation by steps of the analysis.
I
How to tell step-wise code which model specification to use?
I
Run from Waf with command-line options?
I
Gets too involved for complex applications.
I
My task generators currently don’t allow for it.
I
Doesn’t solve the multiple-languages problem.
I
You don’t want to write your own parsers, e.g. in Stata.
Licensed under the Creative Commons Attribution License
38/52
Organising the workflow . . .
. . . how to reach Waf’s bliss point?
1. Write model specifications in JSON.
2. Waf tasks convert them to languages without JSON parser.
3. Waf tasks generate new files for every matrix element:
I
Run the main code as (like) a function.
I
This function takes model parameters as input.
4. Waf tasks run these files as usual.
Makes it easy to specify additional dependencies, output files, etc.
in an atomic fashion, either in wscript or in model_x.json.
Licensed under the Creative Commons Attribution License
39/52
JSON example
{
}
"EXPLANATORY_VARIABLES": "placebo",
"DISTRIBUTION": "weibull",
"OTHER_STREG_OPTIONS": ""
I
Similar syntax as Python: {}, [], integers, floats, . . .
I
Stricter: Only double quotes delimit strings, no redundant
commas, only strings as dictionary keys.
I
http://www.json.org/
I
An editor that provides JSON (or Javascript) syntax
highlighting helps.
Licensed under the Creative Commons Attribution License
40/52
import json
def convert_model_json_to_stata(task):
"""Convert a JSON model specification in ``source[0]`` to a Stata
do-file, storing dictionary entries in globals.
Require the JSON file to contain a single, non-nested, dictionary.
Simply write its entries as Stata globals to the target file.
"""
src_node = task.inputs[0]
tgt_node = src_node.change_ext('.do')
task.set_outputs(tgt_node)
model_pars = json.load(open(src_node.abspath()))
model_name = os.path.splitext(src_node.name)[0]
tgt_content = STATA_MODEL_COMMENT.format(model_name, src_node.abspath())
tgt_content += 'global MODEL_NAME = "{}"\n\n'.format(model_name)
for key, val in model_pars.items():
# Adjust for Stata string notation
if isinstance(val, (str, unicode)):
val = '"{}"'.format(val)
tgt_content += 'global {k} = {v}\n'.format(k=key, v=val)
return tgt_node.write(tgt_content)
Licensed under the Creative Commons Attribution License
41/52
JSON example – Resulting do-file
//
// Header with configuration for model:
//
baseline
//
// Automatically generated by Waf, do not change!
//
// If model parameters need adjustment, perform those in:
//
/Users/project/src/models/baseline.json
//
global MODEL_NAME = "baseline"
global DISTRIBUTION = "weibull"
global EXPLANATORY_VARIABLES = "placebo"
global OTHER_STREG_OPTIONS = ""
Licensed under the Creative Commons Attribution License
42/52
Project hierarchies:
Original data
src
wscript
documentation
library
manual_input
models
original_data
data_management
original_data
analysis
final
paper
dataset_1
dataset_2
documentation
Licensed under the Creative Commons Attribution License
43/52
Project hierarchies:
Step 1: Data management
src
wscript
documentation
library
manual_input
models
original_data
data_management
data_management
analysis
final
wscript
paper
clean_dataset_1.do
clean_dataset_2.do
etc.
Licensed under the Creative Commons Attribution License
44/52
Project hierarchies:
Step 2: Model estimation / simulation
src
wscript
documentation
library
manual_input
models
original_data
data_management
analysis
analysis
final
paper
wscript
descriptives.do
regressions_intuition.do
serious_approach.py
etc.
Licensed under the Creative Commons Attribution License
45/52
Project hierarchies:
Step 3: Visualisation and results formatting
src
wscript
documentation
library
manual_input
models
original_data
data_management
analysis
final
final
paper
wscript
create_tables.py
simple_simulations.py
etc.
Licensed under the Creative Commons Attribution License
46/52
Project hierarchies:
Step 4: Paper and presentations.
src
wscript
documentation
library
manual_input
models
original_data
data_management
analysis
final
paper
paper
wscript
bib (*)
formulas
formulas
research_paper.tex
research_pres_30min.tex
utility_function.py
research_pres_90min.tex
budget_constraint.py
all_tables.tex
etc.
all_figures.tex
Licensed under the Creative Commons Attribution License
47/52
Aside: Suggested layout for a “Literature project”
+-project_root
+-bib
+-your_latex_references.bib
+-database
+-Smith1776.pdf
+-Keynes1936.pdf
+-Hayek1944.pdf
+-class_notes
+-efficient_programming
+-01_introduction.pdf
+-[...]
+-micro
+-other
I
Only pull project_root/bib as an svn:external into actual
research projects.
I
Manage project_root/database via JabRef / Bibdesk.
Licensed under the Creative Commons Attribution License
48/52
Link to the template
I
Code says more than a 1000 words . . .
https://coll.gess.uni-mannheim.de/projects/scripts-library
I
See directory:
trunk/templates/project/
I
Make sure you follow the instructions closely:
trunk/templates/project/README.txt
I
Feedback welcome!
Licensed under the Creative Commons Attribution License
49/52
At the end of this lecture you are able to . . .
I
Reflect on organisational structures for research projects.
I
Work with hierarchical builds in Waf.
I
Differentiate between different “install locations”, handle
svn:externals.
I
Understand relative merits of absolute and relative paths.
I
Construct and use Python packages.
I
Pass options to code in different ways.
I
Read and write JSON-formatted files.
I
Work with the project template provided on the server.
Licensed under the Creative Commons Attribution License
50/52
Acknowledgements and revision number
I
This course is designed after and borrows a lot from the
Software Carpentry course designed by Greg Wilson for
scientists and engineers.
I
The Software Carpentry course material is made available
under a Creative Commons Attribution License, as is this
course’s material.
I
Last changed revision: 551
I
Last changed date: 2011-11-16 19:49:06 +0100 (Wed, 16 Nov
2011)
Licensed under the Creative Commons Attribution License
51/52
License for the course material
[Links to the full legal text and the source text for this page.] You are free:
I to Share to copy, distribute and transmit the work
I to Remix to adapt the work
Under the following conditions:
I Attribution You must attribute the work in the manner specified by the
author or licensor (but not in any way that suggests that they endorse you
or your use of the work).
With the understanding that:
I Waiver Any of the above conditions can be waived if you get permission
from the copyright holder.
I Public Domain Where the work or any of its elements is in the public
domain under applicable law, that status is in no way affected by the license.
I Other Rights In no way are any of the following rights affected by the
license:
I Your fair dealing or fair use rights, or other applicable copyright exceptions and
limitations;
I The author’s moral rights;
I Rights other persons may have either in the work itself or in how the work is
used, such as publicity or privacy rights.
Notice For any reuse or distribution, you must make clear to others the license
terms of this work. The best way to do this is with a link to this web page.
Download