Code documentation - Global Conflict Risk Index

advertisement
Global Conflict Risk Index – Code Doc
Author: Daniel Frederik Mandrella
Contact: Tom De Groeve (tom.de-groeve@jrc.ec.europa.eu)
23 February 2015
Contents
Introduction ............................................................................................................................................ 2
File explanations ..................................................................................................................................... 2
Code .................................................................................................................................................... 2
constants.r ...................................................................................................................................... 2
functions.r ....................................................................................................................................... 2
models.r .......................................................................................................................................... 2
running_models.r ............................................................................................................................ 2
imputation.r .................................................................................................................................... 2
preprocessing.r ............................................................................................................................... 3
experimental_code.r ....................................................................................................................... 3
Data ..................................................................................................................................................... 3
historical data v3.0.csv .................................................................................................................... 3
historical data with YB and Unemployment.csv ............................................................................. 3
unprocessed_current_data.csv....................................................................................................... 3
preprocessed_current_data.csv ..................................................................................................... 3
preprocessed_input_data_unimputed.csv ..................................................................................... 3
preprocessed_input_data_unimputed_new_vars.csv ................................................................... 3
historical_data_imputed_by_R.csv ................................................................................................. 3
historical_data_imputed_by_R_new_vars.csv ............................................................................... 3
How to use existing code ........................................................................................................................ 4
1.
Preprocessing .............................................................................................................................. 4
2.
Imputation .................................................................................................................................. 4
3.
Running the models .................................................................................................................... 5
How to add new models ......................................................................................................................... 6
Function hierarchy and relationship diagram ......................................................................................... 7
Helpful websites/links ............................................................................................................................. 8
Glossary ................................................................................................................................................... 8
Documentation current as of 17-02-2015.
Introduction
The purpose of this documentation is to explain the structure and purpose of the existing code and
related tools for the Global Conflict Risk index (GCRI), how to modify, and how to add to it. As the code
is work-in-progress, so is this documentation, and the two might diverge at some points. Critical
evaluation is always required in such cases in order to find out which of the two (the code, or the
documentation) is wrong/outdated.
The next section of this documentation will briefly explain what code is in which files, and what to do
with it. Then, an explanation will be provided about how to use the existing code to run models, apply
them, what to look out for, and so on. Then, a run-down of how to modify it, e.g. to add new models,
and what modifications are necessary in which places, will be given. Lastly, a collection of helpful links
and websites can be found in the subsequent section. This is mostly for looking for answers for
technical or statistical questions, as well as general explanations of approaches used so far. The final
section is a brief glossary to explain some of the vocabulary used in the code, variable names, code
comments, etc.
The current iteration and most updated version of the code can be found in the folder:
G:\PROJECTS_CRITECH\DG_2014_EEAS_CONFLICTRISKINDEX\METHODOLOGY\MODEL\MODEL 2015 FEBRUARY
(IGNORE FOR NOW -- DANIEL)
File explanations
Code
constants.r
This file just contains all constant variables that are used throughout the other code files. This is also
where the kClassifierThreshold variable can be found and changed if needed.
functions.r
This file contains all R-functions for running the models, applying them to both historical and most
recent data, etc. It does not execute any code and is just a “repository” of sorts for all functions (except
the models) that are used throughout the other files. If you find any kind of function anywhere that is
not a statistical model, look for it in this file. For more details on each function, see directly in their
definitions in the file. Every function is extensively commented with expected inputs, outputs, general
descriptions of their functionalities, etc.
models.r
This file contains all statistical models. Every model is realized as a function, and every function/model
basically executes the same steps, and accepts the same arguments. For a more detailed explanation
of the arguments, see in the file directly.
running_models.r
This is the central file that contains the scripts to actually use the functions from functions.r and
models from models.r. If any kind of output other than imputation or preprocessing are to be run,
then do it with the code from this file.
imputation.r
This file contains the scripts to impute all data sets with missing variables, i.e. the historical dataset.
The functions to do this are in functions.r, so this file only sources and executes them, and additionally
does some data munging/arranging. Imputation is done AFTER preprocessing (see below).
preprocessing.r
File containing the scripts to do various smaller preprocessing tasks, including computing whether any
given country-year is a ‘peace’, ‘war’, or ‘onset’ year. Also adds a new column containing the country’s
region according to EEAS classification. This is needed for the models incorporating region as a random
factor.
experimental_code.r
As the name suggest. Some first tries at Bayesian regression analysis and related. Not used to produce
any kind of output.
Data
historical data v3.0.csv
As the name suggests, this is the csv file holding the historical data from 1989 to 2009 for all
independent variables, conflict data, and country-years. It does not include war/peace/onset columns,
and uses the more verbose column names (i.e. GCRI.POL.REG_P2 vs. non-verbose REG_P2). This file is
read in by preprocessing.r for preprocessing purposes.
historical data with YB and Unemployment.csv
This data set is identical to “historical data v3.0.csv” except that it also includes a few new independent
variables, notably CORRUPT, YOUTHBMALE, YOUTHBBOTH, and INEQ_SWIID. YOUTHBMALE and
YOUTHBOTH respectively refer to youth bulge calculated for only males, and for both genders.
unprocessed_current_data.csv
This is the csv file containing the most recent/current data, e.g. only the data for all countries in 2014.
It is supposed to be used to apply the models and come up with predictions.
preprocessed_current_data.csv
As the name suggests, this is the preprocessed version of the above.
preprocessed_input_data_unimputed.csv
This is the historical data without the new variables after it has gone through preprocessing, but before
it has been imputed.
preprocessed_input_data_unimputed_new_vars.csv
This is the historical data with new variables after it has gone through preprocessing, but before it has
been imputed.
historical_data_imputed_by_R.csv
This is the historical data without new variables after it has gone through both preprocessing and
imputation. It is now ready to be used for running the models.
historical_data_imputed_by_R_new_vars.csv
This is the historical data with new variables after it has gone through both preprocessing and
imputation. It is now ready to be used for running the models.
How to use existing code
Data processing workflow
2. imputation.r
1. preprocessing.r
3. running_models.r
source()
source()
source()
functions.r
models.r
constants.r
1. Preprocessing
The first step is the preprocess the data sets. The current iteration of the preprocessing script does
the following:
-
-
It reads in the historical data
It orders historical data first by ISO code, then by year, such that:
o AFG 2009
o AFG 2008
o …
o ZWE 1990
o ZWE 1989
It then calculates Boolean vectors for single-year Violent and Highly Violent conflicts, for each
conflict dimension; it attaches these new vectors as columns to the historical data set
It then switches the column names from verbose to less verbose (i.e. from GCRI.POL.REG_P2
to just REG_P2), computes whether a given country-year is in ‘war’, ‘peace’, or ‘onset’ for
every conflict dimension and intensity level (i.e. four new columns), and it adds a new column
denoting every country’s region according to EEAS classification
Note that the script does this twice currently: once for the data set without the new variables (YOUTHB
and UNEMP at the moment) and once for the data set that includes those. The script also preprocesses
the most recent data by switching column names and converting the relevant columns to numeric.
2. Imputation
The imputation script first loads the preprocessed data in (pay attention that the name of the output
file from preprocessing.r matches the name of the input file name in imputation.r). It splits the data
into several data frames because only the main variables need to be imputed, and the other ones
should not be used to impute those. The MICE (Multiple Imputation by Chained Equations) package
then imputes the data 10 times with the mice() function. The complete() function then creates a
complete data set by combining one of the imputations with the original existing data (these
imputations are the variables d0, d1, d2, …, etc.). The imputations are then simply averaged, the full
data set is reconstructed (this is the reversal of splitting the data in the beginning), and saved to a new
csv file.
Similarly to the preprocessing.r script, the imputation is done twice, once for the data set without the
new variables, and once for the data set with them. Execute the one needed, they are both
independent of each other, including loading the required packages, setting the random number
generator seed, etc.
3. Running the models
The running_models.r script is the centrepiece of the code base from which all of the actual execution
of models is done. The file is partitioned into a few different sections. The first one called “Setup”
loads required packages, sources the functions.r, models.r, and constants.r files, and sets the seed for
the random number generator. It also sets the working directory: this needs to be changed according
to where the other files actually are. If the next iteration of the overall model is for example Model
2015 March, then the path needs to be changed accordingly. Otherwise it might accidentally execute
models and code from a previous iteration.
Running one or more models works like this:
-
-
-
-
-
-
A list object simply called “models” should contain named (!) elements corresponding to the
model functions from models.r that should be run, i.e. an element called BaseModel(), an
element called AllInteractions(), etc.
Model functions can be added to the “models” list simply via assigning them like this:
o models[[‘ModelName’]] <- ModelName
o ModelName has to be a model function specified in models.r
The string in the double brackets is the name the element in the list is given, and the list can
be indexed by these names (in other programming languages, e.g. Python, this is akin to a
dictionary where the keys are strings and the values are functions)
The named list of models is then passed to the runModels() function, along with the data with
which the models should be trained and evaluated
o A variety of additional arguments can be passed to runModels(), including the
threshold, if cross validation should be used, etc. – see the function specification in
functions.r for more details)
o Note that the data passed to runModels() should be the historical data; this function
is only supposed to run a variety of models and compare their performance – if the
models should be applied, see the applyModels() function instead
Since the output of the runModels() is difficult to read if a lot of models have been run, the
function compareModels() summarizes the results in a comparison matrix where the different
metrics are in the columns and the model names in the rows; this function also allows to save
this comparison matrix in a .csv file
The compareModels() function also includes the possibility to pass an additional string via the
custom.name argument that will be added to the file name in case the matrix is saved to csv.
This is useful if the same models are run with different thresholds, with or without cross
validation, etc., so that in the end the different comparison matrices do not override each
other
Applying the models to the most recent data works similarly to running the models to compare
performances. The list of models that should be applied is passed to the applyModels() function, along
with the training data (“train.data” argument) and the most recent data (“apply.data” argument).
Additional arguments exist, see the function specification for details. Note that the applyModels()
function does not depend on having previously executed the models via runModels() or similar.
Instead, it runs the models on the training data (without cross validation), and then predicts new
intensity values for the most recent data via predict().
The reasoning behind this setup is as follows: with the runModels() function, the best model in terms
of performance can be found. Preferably, this is done via cross validation to reduce the risk of
overfitting when using the whole data set for training and testing. If it is done via cross validation, in
each iteration (i.e. for each fold), only a subset of the data is used to calculate the coefficients.
However, when applying the model to the most recent data, all the data should be used to train the
model since this choice of parameters (the variables) and statistical model (logistic, zero-inflated
negative binomial, etc.) has been found to perform best with cross validation. So instead of using a
subset of the historical data, all historical data is used in applyModels().
The output from applyModels() is a named list where each element consists of the most recent data
plus six columns representing that specific model’s predicted probabilities, intensities, and overall
predictions for both the national power and subnational conflict dimensions. Using the
saveAppliedModelsOutput() function, these lists can be saved to .csv files where the file name
corresponds to the respective model. As an example, if applyModels() produces a named list of length
2 where the first element is the most recent data plus the forecasts from BaseModel, and the second
element the same with the forecasts from AllInteractions, then saveAppliedModelsOutput() will
produce two output csv files named “2015-02-12__AllInteractions__applied-to-historical.csv” and
“2015-02-12__BaseModel__applied-to-historical.csv” (note: “applied to historical” refers to the fact
that the output is the historical data set with predictions from the model for the historical data; the
other option is “applied-to-most-recent” indicating the forecasts for the most recent data). The results
can then further be analysed and visualised in Excel.
How to add new models
In order to add new models to the existing infrastructure, three things have to be done.
1. In the models.r file, a function has to be added that executes the statistical model(s),
calculates their performance metrics, and creates the predictions from the model. The overall
structure of this function is the same for all statistical models, and most of the code can be
copied from existing ones. Based on the input for the conflict.dimension argument, the
function should be usable for both national power and subnational predictions. For this
reason, the dependent variables for these two types should not be hardcoded into the model
specifications but instead be assigned depending on the input. The formula for the model
specification is currently created by the combination of the paste() and as.formula() functions.
See ?paste and ?as.formula for more information on how these work. Overall, keep as close
to the structure of existing model functions.
2. When the new model function has been specified, a new function to calculate the predictions
for this new model has to be written in the functions.r file. Currently existing prediction
functions are for example computePredictionsGLMandLM() for the combination of a
generalized linear model (logistic model) with a standard linear model; another is
computePredictionsZEROINFLorHURDLE(), for zero-inflated negative binomial and hurdle
models respectively. These functions are usually just wrappers around predict() functions with
a few auxiliary tasks added (e.g. limiting the predictions to a maximum of 10 or minimum of
0), and in the case of two combined models, the combination of the two via a static or dynamic
threshold (e.g. if the first is a logistic model predicting probabilities, and the second is a linear
model predicting intensities, then the combination of the two means to assign a final
prediction value equal to the predicted intensity if and only if the predicted probability for
that country-year is above a certain threshold – and zero otherwise).
3. When the above two functions have been written, the last thing to do, is to modify the
computePredictions() function to call the appropriate sub-functions for the particular types of
models passed to computePredictions(). To illustrate, computePredictions() checks the
classes/types of the statistical models passed to it, and if those are for example ‘glm’ and ‘lm’,
it calls computePredictionsGLMandLM(); if the first model instead is ‘zeroinfl’ and the second
model is ‘NULL’ (i.e. there was none), then it calls computePredictionsZEROINFLorHURDLE();
etc.
See the graphic on the next page for a visual illustration of function hierarchy.
Function hierarchy and relationship diagram
ModelFunction()
6. results
1. call
3. input
calculateMetrics()
5. results
4. call
determineOutcome()
computePredictions()
2. call
(example functions)
computePredictionsTwoLasso()
computePredictionsZEROINFLorHURDLE()
computePredictionsGLMandLM()
2.5 call if
threshold == ‘dynamic’
determineThresholdDynamically()
Helpful websites/links
For a code style guide, I have mostly followed Google’s R Style Guide found here: https://googlestyleguide.googlecode.com/svn/trunk/Rguide.xml
I did make some modifications, however. For example, instead of using UpperCamelCase I used
LowerCamelCase, and the model functions are not named according to the guide.
For statistics-related questions, Cross Validated (http://stats.stackexchange.com/) is a great resource.
For programming-related problems, Stack Overflow (http://stackoverflow.com/) is the go-to address.
Glossary
Download