Effective programming practices for economists 12. Sensible project layouts Hans-Martin von Gaudecker Department of Economics, Universität Mannheim Licensed under the Creative Commons Attribution License 1/52 At the end of this lecture you are able to . . . I Reflect on organisational structures for research projects. I Work with hierarchical builds in Waf. I Differentiate between different “install locations”, handle svn:externals. I Understand relative merits of absolute and relative paths. I Construct and use Python packages. I Pass options to code in different ways. I Read and write JSON-formatted files. I Work with the project template provided on the server. Licensed under the Creative Commons Attribution License 2/52 Need some organisation for complex projects I So far, files were mostly assumed to live in one directory. I Regardless of type or purpose: Input, output, data, source code, results, build scripts . . . I This does not scale up. I Need to divide things up over various directories. I No golden rules, but some underlying principles. Licensed under the Creative Commons Attribution License 3/52 Principles for project organisation I Group files with a common purpose in a directory. I Encapsulation – related chunks should be stand-alone. I Separate sources and output. I Make it easy to find often-used outputs. I Make it easy to understand for both humans and computers. I DRY! Licensed under the Creative Commons Attribution License 4/52 project_root waf wscript documentation bld [bld] [project documentation] src [project documentation.pdf] out [research_paper.pdf] wscript out conf.py index.rst data [research_pres_30min.pdf] analysis [research_pres_90min.pdf] final src figures tables src introduction.rst library original_data.rst data_management.rst wscript analysis.rst stata final.rst python paper.rst etc. library.rst wscript models.rst stata ado_ext ado_local manual_input documentation wscript library some_figure.pdf manual_input some_table.tex models original_data original_data data_management analysis etc. models dataset_1 final dataset_2 paper documentation wscript baseline.json robust_unobs_het.json etc. analysis data_management wscript descriptives.do wscript regressions_intuition.do clean_dataset_1.do serious_approach.py clean_dataset_2.do etc. etc. final wscript create_tables.py simple_simulations.py etc. paper wscript bib (*) formulas formulas research_paper.tex research_pres_30min.tex utility_function.py research_pres_90min.tex budget_constraint.py all_tables.tex etc. all_figures.tex Project hierarchies: The highest level project_root waf wscript [bld] bld [project documentation] [project documentation.pdf] src [research_paper.pdf] out out [research_pres_30min.pdf] [research_pres_90min.pdf] data src analysis src final figures wscript tables documentation library manual_input models original_data data_management analysis final paper Licensed under the Creative Commons Attribution License 6/52 Hierarchical builds I Writing all build instructions in a single wscript file would quickly become a mess. I Add wscript file to directory down the hierarchy, only required content is: def build(ctx): ctx(features='...', ...) I The parent directory’s wscript should contain something like: def build(ctx): ctx(features='...', ...) ctx.recurse('child_dir') Licensed under the Creative Commons Attribution License 7/52 Hierarchical builds I All target / source / deps file names are interpreted relative to the location of the wscript file. I When executing tasks, Waf sets the working directory to bld (or whatever you specified in the main wscript for out). I Some tools work differently (LATEX). I Set the launch directory in Eclipse / Stata / . . . I Implies you have to be careful when loading data, running other files, etc.. Licensed under the Creative Commons Attribution License 8/52 Putting things together Reminder of the structure project_root waf wscript [bld] [project documentation] [project documentation.pdf] [research_paper.pdf] [research_pres_30min.pdf] [research_pres_90min.pdf] src src wscript documentation library manual_input models original_data data_management analysis final paper Licensed under the Creative Commons Attribution License 9/52 Putting things together wscript def configure(ctx): ctx.load('biber') ctx.load('run_py_script') ctx.load('run_do_file') ctx.load('sphinx_build') ctx.load('write_project_headers') def build(ctx): ctx.env.PROJECT_PATHS = set_project_paths(ctx) ctx.path_to = path_to ctx.recurse('src') Licensed under the Creative Commons Attribution License 10/52 Putting things together src/wscript def build(ctx): ctx.recurse('library') ctx.recurse('models') ctx.recurse('manual_input') # Adding the build group is required because the project # paths have to be generated first. It should not entail # any performance penalties -- all tasks before this # point generally run very fast. ctx.add_group() ctx.recurse('data_management') ctx.recurse('analysis') ctx.recurse('final') # Safety measure -- else the LaTeX scanner might fail becau # '.tex'-source file has not been generated. ctx.add_group() ctx.recurse('paper') ctx.recurse('documentation') Licensed under the Creative Commons Attribution License 11/52 Do the analysis step-by-step . . . 0. Library, models, manual figures / tables. 1. Data management. 2. Actual analysis (estimation / simulation). 3. Visualisation and results formatting. 4. Combine things into a paper and presentations. Licensed under the Creative Commons Attribution License 12/52 Project hierarchies: Aside: Manual figures and tables Just copy everything to tables/figures directories. src wscript documentation library manual_input manual_input models original_data data_management analysis final wscript some_figure.pdf some_table.tex etc. paper Licensed under the Creative Commons Attribution License 13/52 Project hierarchies: The code library src wscript documentation library manual_input library models original_data wscript data_management stata analysis python final etc. paper stata ado_ext ado_local Licensed under the Creative Commons Attribution License 14/52 Absolute or relative paths? I Absolute path: C:\Documents and Settings\me\my_project\data /Users/me/my_project/data I Relative path: ..\..\data ../../data I Relative paths are more portable across machines . . . I I Slash / backslash not too relevant anymore (Stata, Python). . . . but paths are relative to the location of your interpreter / program; rather than the do-file. Licensed under the Creative Commons Attribution License 15/52 Finding your way around with relative paths I I Programs / interpreters typically start in some default directory, unless you start them in a special way. I Start from the shell → usually the same directory I Open a file → often that directory I No guarantees. Remember Waf launches tasks in bld. Licensed under the Creative Commons Attribution License 16/52 Finding your way around with relative paths I I I Python: I Find out current path with os.getcwd() I Change directories with os.chdir("path/to/some/dir") Stata: I Find out current path with display c(pwd) I Change directories with cd "path/to/some/dir" Try to avoid changing the path within programs, rather prepend the path to load/save statements etc.. Licensed under the Creative Commons Attribution License 17/52 Use auto-generated absolute paths I Specify all relevant paths as Waf nodes in the main wscript. I Create tasks that generate headers with absolute paths for all required languages as part of the build. I Best of both worlds: I I Portable across machines. I No ambiguity with respect to launch directory. I No need to worry about changing directories in scripts. I No problems of including files at various hierarchical levels. Pre-implemented for Stata and Python in project template. Licensed under the Creative Commons Attribution License 18/52 Stata search paths I Stata looks for do-files in its current directory. I I Stata looks for programs (ado-files) in the directories specified in adopath. I I I Run those with do filename / include filename (the latter keeps the local macros defined in filename.do in memory). For internal use (sysdir) or user-specified additions. E.g. PLUS is used when you install stuff via ssc install, PERSONAL is self-explanatory. Change things so projects become more self-contained. Licensed under the Creative Commons Attribution License 19/52 def set_project_paths(ctx): """Return a dictionary with project paths represented by Waf nodes.""" pp = {} # The PROJECT_ROOT path will be appended to the PYTHONPATH environmental var # Do the same in the Eclipse project settings, if applicable. pp['PROJECT_ROOT'] = '.' pp['IN_DATASET_1'] = 'src/original_data/dataset_1' pp['IN_LIBRARY'] = 'src/library' pp['IN_MODELS'] = 'src/models' pp['OUT_DATA'] = '{}/out/data'.format(out) pp['OUT_ANALYSIS'] = '{}/out/analysis'.format(out) pp['OUT_FINAL'] = '{}/out/final'.format(out) pp['OUT_FIGURES'] = '{}/out/figures'.format(out) pp['OUT_TABLES'] = '{}/out/tables'.format(out) # Stata's adopaths get special treatment. lib = pp['IN_LIBRARY'] pp['ADO'] = {} pp['ADO']['PERSONAL'] = os.path.join(lib, 'stata/ado_ext/personal') pp['ADO']['PLUS'] = os.path.join(lib, 'stata/ado_ext/plus') pp['ADO']['LOCAL'] = os.path.join(lib, 'stata') # Convert the directories into Waf nodes. for key, val in pp.items(): if not key == 'ADO': pp[key] = ctx.path.make_node(val) else: for adokey, adoval in val.items(): pp[key][adokey] = ctx.path.make_node(adoval) return pp Licensed under the Creative Commons Attribution License 20/52 Specifying paths — I. The main wscript def path_to(ctx, pp_key, *args): """Return the relative path to os.path.join(*args*) in the directory PROJECT_PATHS[pp_key] as seen from ctx.path (i.e. the directory of the current wscript). Use this to get the relative path---as needed by Waf---to a file in one of the directory trees defined in the PROJECT_PATHS dictionary above. We always pretend everything is in the source directory tree, Waf takes care of the correct placing of targets and sources. """ path_in_tree = os.path.join(args) node = ctx.env.PROJECT_PATHS[pp_key].find_or_declare(path_in_tree).get_src() return node.path_from(ctx.path) def build(ctx): ctx.env.PROJECT_PATHS = set_project_paths(ctx) ctx.path_to = path_to ctx.recurse('src') Licensed under the Creative Commons Attribution License 21/52 Specifying paths II. src/library/stata/wscript def build(ctx): ctx(features='write_project_paths', target='project_paths.do') I Task generator automatically recognises output format for the header file based on the extension. I Analogously in src/library/python/wscript. Licensed under the Creative Commons Attribution License 22/52 Specifying paths III. Resulting Stata header file // // // // // // // // // Header with path definitions for entire project. Automatically generated by Waf, do not change! If paths need adjustment, perform those in the root wscript file. Note that the paths are added to the top of the ado-path. sysdir set adopath ++ adopath ++ sysdir set global global global global global global global global global PERSONAL "/Users/project/src/library/stata/ado_ext/personal/" "/Users/project/src/library/stata/" "/Users/project/bld/src/library/stata/" PLUS "/Users/project/src/library/stata/ado_ext/plus/" PATH_IN_DATASET_1 "/Users/project/src/original_data/dataset_1/" PATH_IN_LIBRARY "/Users/project/src/library/" PATH_IN_MODELS "/Users/project/src/models/" PATH_OUT_ANALYSIS "/Users/project/bld/out/analysis/" PATH_OUT_DATA "/Users/project/bld/out/data/" PATH_OUT_FIGURES "/Users/project/bld/out/figures/" PATH_OUT_FINAL "/Users/project/bld/out/final/" PATH_OUT_TABLES "/Users/project/bld/out/tables/" PATH_PROJECT_ROOT "/Users/project/" Licensed under the Creative Commons Attribution License 23/52 Specifying paths IV. Example Stata file // Header do-file with path definitions, those end up in local macros. include src/library/stata/project_paths log using `"${PATH_OUT_ANALYSIS}/log/`1'.log"', replace // Delete these lines -- just to check whether everything caught correctly. adopath macro list I Note that run_do_file passes the do-file’s name without the .do extension as the first argument. I This ends up in the local macro `1'. Licensed under the Creative Commons Attribution License 24/52 Specifying paths V. Task for running the Stata file def build(ctx): def out_analysis(*args): return ctx.path_to(ctx, 'OUT_ANALYSIS', *args) def out_data(*args): return ctx.path_to(ctx, 'OUT_DATA', *args) # Illustrate simple use of run_do_file. ctx(features='run_do_file', source='descriptives.do', target=out_analysis('log', 'descriptives.log'), deps=['../library/stata/project_paths.do', out_data('streg_example_data.signature')], name='descriptives_example') Licensed under the Creative Commons Attribution License 25/52 Aside: Guide to adding Stata packages Changing the system directory PLUS causes previously made system-wide additions to become unavailable. To add something: 1. Launch a Stata GUI session. 2. Copy the line starting with sysdir set PLUS from bld/src/library/stata/project_paths.do and paste it into the Stata command prompt. 3. Install your package, e.g. ssc install tabout. 4. Commit your changes to the scripts-library repository (you may have to ask me for write permissions, or you could set up an own project like that). Licensed under the Creative Commons Attribution License 26/52 Python packages I Remember Python looks for files on sys.path. I Spread modules across different directories using packages. I I Add a (usually empty) file called __init__.py to subdirectories you want to be part of your package. Then import some_function defined in module.py from subdir with: from subdir.module import some_function I Make sure the top directory is on sys.path (hierarchical). I The structure nests (need __init__.py at every level). Licensed under the Creative Commons Attribution License 27/52 """Define a dictionary *project_paths* with path definitions for the entire project. This module is automatically generated by Waf, never change it! If paths need adjustment, change them in the root wscript file. """ import os project_paths = {} project_paths['IN_DATASET_1'] = r'/Users/project/src/original_data/dataset_1' project_paths['IN_LIBRARY'] = r'/Users/project/src/library' project_paths['IN_MODELS'] = r'/Users/project/src/models' project_paths['OUT_ANALYSIS'] = r'/Users/project/bld/out/analysis' project_paths['OUT_DATA'] = r'/Users/project/bld/out/data' project_paths['OUT_FIGURES'] = r'/Users/project/bld/out/figures' project_paths['OUT_FINAL'] = r'/Users/project/bld/out/final' project_paths['OUT_TABLES'] = r'/Users/project/bld/out/tables' project_paths['PROJECT_ROOT'] = r'/Users/project' def project_paths_join(key, *args): """Given input of a *key* in the *project_paths* dictionary and a number of path arguments *args*, return the joined path constructed by:: os.path.join(project_paths[key], *args) """ return os.path.join(project_paths[key], *args) Specifying paths VII. Example Python file """Example file demonstrating how to import the project_paths_join convenience function. """ from bld.src.library.python.project_paths import project_paths_join out_path = project_paths_join('OUT_ANALYSIS', 'simulation_results.txt') with open(out_path, 'w') as results_file: results_file.write('This is a simple test.\n') I run_py_script adds the project root directory to PYTHONPATH. Licensed under the Creative Commons Attribution License 29/52 Project hierarchies: Model specifications src wscript documentation library manual_input models original_data models data_management analysis final paper wscript baseline.json robust_unobs_het.json etc. Licensed under the Creative Commons Attribution License 30/52 Where to put model parameters? And how to best store them? I I Essentially the same issue as with paths: I (Might) need them in multiple languages, . . . I . . . at least one language and in Waf. But it goes deeper than that . . . Licensed under the Creative Commons Attribution License 31/52 Organising the workflow . . . . . . by steps of the analysis? Licensed under the Creative Commons Attribution License 32/52 Organising the workflow . . . . . . by model? Licensed under the Creative Commons Attribution License 33/52 Organising the workflow . . . . . . what if not everything is used at every step? Licensed under the Creative Commons Attribution License 34/52 Organising the workflow . . . . . . how to minimise code duplication? I Write code by step of the analysis; think of model specifications as libraries? I Incorporate all models via a for-loop . . . forvalues m = 1 / 7 { include "`PATH_LIBRARY'/models/model`m'" } regress `depvar' `exogvars' I But only if execution time is close to negligible. I Else you don’t want to re-run all 3 (10? 20?) models if you change the assumptions of one of them. I Difficult to avoid running all steps for all models. Licensed under the Creative Commons Attribution License 35/52 Organising the workflow . . . . . . how to minimise code duplication? I Write code by model specification; think of the actual computations as libraries? local depvar = "ln_income" local exogvars = "education female" do ../data_management/data_management_main do ../analysis/analysis_main do ../final/final_main I Problematic if upstream steps take very long. I Difficult to avoid running all steps for all models. Licensed under the Creative Commons Attribution License 36/52 Organising the workflow . . . . . . how to reach Waf’s bliss point? Licensed under the Creative Commons Attribution License 37/52 Organising the workflow . . . . . . how to reach Waf’s bliss point? I Directory organisation by steps of the analysis. I How to tell step-wise code which model specification to use? I Run from Waf with command-line options? I Gets too involved for complex applications. I My task generators currently don’t allow for it. I Doesn’t solve the multiple-languages problem. I You don’t want to write your own parsers, e.g. in Stata. Licensed under the Creative Commons Attribution License 38/52 Organising the workflow . . . . . . how to reach Waf’s bliss point? 1. Write model specifications in JSON. 2. Waf tasks convert them to languages without JSON parser. 3. Waf tasks generate new files for every matrix element: I Run the main code as (like) a function. I This function takes model parameters as input. 4. Waf tasks run these files as usual. Makes it easy to specify additional dependencies, output files, etc. in an atomic fashion, either in wscript or in model_x.json. Licensed under the Creative Commons Attribution License 39/52 JSON example { } "EXPLANATORY_VARIABLES": "placebo", "DISTRIBUTION": "weibull", "OTHER_STREG_OPTIONS": "" I Similar syntax as Python: {}, [], integers, floats, . . . I Stricter: Only double quotes delimit strings, no redundant commas, only strings as dictionary keys. I http://www.json.org/ I An editor that provides JSON (or Javascript) syntax highlighting helps. Licensed under the Creative Commons Attribution License 40/52 import json def convert_model_json_to_stata(task): """Convert a JSON model specification in ``source[0]`` to a Stata do-file, storing dictionary entries in globals. Require the JSON file to contain a single, non-nested, dictionary. Simply write its entries as Stata globals to the target file. """ src_node = task.inputs[0] tgt_node = src_node.change_ext('.do') task.set_outputs(tgt_node) model_pars = json.load(open(src_node.abspath())) model_name = os.path.splitext(src_node.name)[0] tgt_content = STATA_MODEL_COMMENT.format(model_name, src_node.abspath()) tgt_content += 'global MODEL_NAME = "{}"\n\n'.format(model_name) for key, val in model_pars.items(): # Adjust for Stata string notation if isinstance(val, (str, unicode)): val = '"{}"'.format(val) tgt_content += 'global {k} = {v}\n'.format(k=key, v=val) return tgt_node.write(tgt_content) Licensed under the Creative Commons Attribution License 41/52 JSON example – Resulting do-file // // Header with configuration for model: // baseline // // Automatically generated by Waf, do not change! // // If model parameters need adjustment, perform those in: // /Users/project/src/models/baseline.json // global MODEL_NAME = "baseline" global DISTRIBUTION = "weibull" global EXPLANATORY_VARIABLES = "placebo" global OTHER_STREG_OPTIONS = "" Licensed under the Creative Commons Attribution License 42/52 Project hierarchies: Original data src wscript documentation library manual_input models original_data data_management original_data analysis final paper dataset_1 dataset_2 documentation Licensed under the Creative Commons Attribution License 43/52 Project hierarchies: Step 1: Data management src wscript documentation library manual_input models original_data data_management data_management analysis final wscript paper clean_dataset_1.do clean_dataset_2.do etc. Licensed under the Creative Commons Attribution License 44/52 Project hierarchies: Step 2: Model estimation / simulation src wscript documentation library manual_input models original_data data_management analysis analysis final paper wscript descriptives.do regressions_intuition.do serious_approach.py etc. Licensed under the Creative Commons Attribution License 45/52 Project hierarchies: Step 3: Visualisation and results formatting src wscript documentation library manual_input models original_data data_management analysis final final paper wscript create_tables.py simple_simulations.py etc. Licensed under the Creative Commons Attribution License 46/52 Project hierarchies: Step 4: Paper and presentations. src wscript documentation library manual_input models original_data data_management analysis final paper paper wscript bib (*) formulas formulas research_paper.tex research_pres_30min.tex utility_function.py research_pres_90min.tex budget_constraint.py all_tables.tex etc. all_figures.tex Licensed under the Creative Commons Attribution License 47/52 Aside: Suggested layout for a “Literature project” +-project_root +-bib +-your_latex_references.bib +-database +-Smith1776.pdf +-Keynes1936.pdf +-Hayek1944.pdf +-class_notes +-efficient_programming +-01_introduction.pdf +-[...] +-micro +-other I Only pull project_root/bib as an svn:external into actual research projects. I Manage project_root/database via JabRef / Bibdesk. Licensed under the Creative Commons Attribution License 48/52 Link to the template I Code says more than a 1000 words . . . https://coll.gess.uni-mannheim.de/projects/scripts-library I See directory: trunk/templates/project/ I Make sure you follow the instructions closely: trunk/templates/project/README.txt I Feedback welcome! Licensed under the Creative Commons Attribution License 49/52 At the end of this lecture you are able to . . . I Reflect on organisational structures for research projects. I Work with hierarchical builds in Waf. I Differentiate between different “install locations”, handle svn:externals. I Understand relative merits of absolute and relative paths. I Construct and use Python packages. I Pass options to code in different ways. I Read and write JSON-formatted files. I Work with the project template provided on the server. Licensed under the Creative Commons Attribution License 50/52 Acknowledgements and revision number I This course is designed after and borrows a lot from the Software Carpentry course designed by Greg Wilson for scientists and engineers. I The Software Carpentry course material is made available under a Creative Commons Attribution License, as is this course’s material. I Last changed revision: 551 I Last changed date: 2011-11-16 19:49:06 +0100 (Wed, 16 Nov 2011) Licensed under the Creative Commons Attribution License 51/52 License for the course material [Links to the full legal text and the source text for this page.] You are free: I to Share to copy, distribute and transmit the work I to Remix to adapt the work Under the following conditions: I Attribution You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). With the understanding that: I Waiver Any of the above conditions can be waived if you get permission from the copyright holder. I Public Domain Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license. I Other Rights In no way are any of the following rights affected by the license: I Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations; I The author’s moral rights; I Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights. Notice For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to this web page.