TheLVIkNNAnalysisProcess_Mar_27_2014

advertisement
The Landscape Vegetation Inventory (LVI) Nearest Neighbour Analysis Process1
1. Overview
These instructions are developed with reference to completing nearest neighbour assignments
for the purpose of estimating missing Y-variable’s in a Target Dataset (that contains X- but not Yvariables). X-variables are selected from a Reference Dataset (that contains both X- and Yvariables) to best estimate combinations of Y-variables. In this particular version of the process
combinations of Y-variables are grouped into classes. The next step is to explore different
numbers and combinations of variables for the purpose of explaining differences in
classifications (based on the Y-variable set). Certain metrics are produced to help the analyst
determine the best number and combination of variables to use, and what weights to use (if
any). Having selected a preferred option, the chosen variables ( weights) are used to identify
k= 1 to n Nearest Neighbours (and associated distances) for each of the observations in the
Target Data (Y-variables are absent or missing). A Y-variable dataset is then generated for each
observation in the Target dataset, using the k = 1 to n nearest neighbours and the associated Yvariables (in the Reference dataset).
The entire process may also be applied to the Reference Dataset and further analysis (Bias;
Signed Relative Bias or SRB; Root Mean Squared Prediction Error or RMSPE) carried out to
compare different numbers and combinations of variables, distance metrics (Euclidean,
Mahalanobis, and a slightly twisted Most Similar Neighbour or MSN), with or without distance
weighting, with or without variable weighting, with or without variable transformations, with or
without normalizing the data, and including or excluding the (k=1) nearest neighbours .
There is room for improvement in this process including:
a) Developing testing procedures that make use of an independent Training and Test data
set, wrapped up within a Monte Carlo process for extracting different subsets from the
Reference Data to represent Training and Test datasets. This process can be
implemented on a small scale by manually creating Training and Test datasets and
running these through the process.
b) Extending the process to make use of ensemble modeling. This would extend the
process to make use of multiple models in determining nearest neighbours with
heuristics to select the k-nearest neighbours based on distance or by some kind of
voting procedure (much like Random Forests).
c) Integrating other procedures like Random Forests into the overall framework.
d) There is some legacy code that requires refactoring to be fully compliant with the design
standards (discussed briefly below) for this system.
One of the key strengths of this process is the programming framework. To begin with it makes
use of both R and Python. R has tremendous depth in terms of different statistical packages
that are available but the data handling routines are not as flexible as Python. Within the
1
Prepared by: Ian Moss PhD, Tesera Systems Inc., 619 Goldie Avenue, Victoria, BC V9B 6C1. Phone: 250.391.0975.
email: ian.moss@tesera.com
April 10, 2014
1
Python environment extensive use is made of numpy that has been developed to manage
matrix algebra and provides significant advantages in terms of processing speed. The integration
of these two different coding environments has been achieved by the launching of Python code
from within R (which is relatively simple to do) and with communication by way of input and
output files using standard naming conventions. The modular approach to the development of
this code, along with a few basic coding standards (use of configuration files; internal (program)
and external (application) documentation; separation of routines designed to manage
workflows from routines designed to carry out particular functions or analysis; use of a few
standard data management files like the data dictionary and the files used to manage variable
selection), allows individuals to add to the framework. Individuals can do so by making use of
existing code with a minimum of additional new code and new code should be developed along
similar lines to further enhance the toolset and continue the spirit of sharing and save collective
time (How many times does one have to write a read.csv statement in R and attach variable
names?). The concept here is that the framework could continue to be built and shared with
others in an open source environment. Of course there is need for one (or more) gate keeper(s)
to ensure that the code remains reasonably well organized and to ensure that any new code
does indeed meet the requirements. Ultimately the design of this system has also been
established with the idea that it can be web-enabled.
2. The Process
2.1. Data preparation Y-Variable Dataset (Reference Data)
0.1.1. Natural log and log odds transformations
2.2. Data preparation X-Variable Dataset (Reference & Target Data)
2.2.1. Remote sensing data
2.2.2. Ancillary variables/datasets
2.2.2.1. Stage Terrain Variables (Slope, Aspect, Elevation):
/Rwd/Python/Config/stageTerrainConfig.txt
/Rwd/Python/ComputeStageAspectElevVariables.py
To join variables to original dataset:
/Rwd/Python/Config/innerJoinConfig.txt
/Rwd/Python/innerJoinLargeDataset.py
2.2.2.2. ClimateWNA Variables:
http://climatewna.com/
[Accessed April 3 2014]
2.2.2.3. Biogeoclimatic-Ecosystem Classification (BEC):
Terrestrial Ecosystem Mapping (TEM)
Predictive Ecosystem Mapping (PEM)
http://www.env.gov.bc.ca/ecology/tem/
[Accessed April 3 2014]
2.2.2.4. Existing Vegetation Resource Inventory (VRI) Data
2.2.2.5. UTM Coordinates and/or latitude and longitude
2
2.2.2.6. Terrain Indices2
2.2.3. About Config Files
2.2.3.1. Detailed descriptions for each data entry contained in:
/Rwd/Python/…ConfigInfo.txt
2.2.4. Natural log and log odds transformations:
2.2.4.1. Natural Log Transformations (A is an untransformed variable):
Recommended for HT3, AGE4, G5, N6, V7
A’ = LN(A+1)
This transformation works as long as A is non negative.
2.2.4.2. Log odds transformations (for proportions between 0 and 1):
Reccommended for CC8.
if p = 0 then:
lodds = ln(0.01/0.99)
else:
if p = 1 then:
lodds = ln(0.99/0.01)
else:
lodds = ln(p/(1-p))
2.2.5. Data dictionaries standard format (*mandatory):
(May be required for quality control/error detection)
TVID*
TABLENAME*
VARNAME*
NEWVARNAME
VARTYPE*
VARDEFAULT
SELECT
DESCRIPTION
2.3. Fuzzy C-Means Classification
When applying this process for the purpose of selecting an X-Variable set to best explain
differences in selected Y-Variables it is recommended (as a general guideline) that 4 systems
of classification be developed, each with 5 (4-6) classes. The next step (after 0.4, 0.5)
2
Wilson, J.P. 2012. Digital terrain modeling. Geomorphology 137:107-121.
Hijmans, R.J. and Etten, J. 2013. Package raster.
http://cran.at.r-project.org/web/packages/raster/raster.pdf [Accessed April 3 2014]
3
Height (m)
4
Age (y)
5
Basal area (m2ha-1)
6
Numbers of trees per hectare (ha-1)
7
Volume (m3ha-1)
8
Crown closure expressed as a proportion rather than percent.
3
involves using these classifications in Multiple Discriminant Analysis (MDA) to select an XVariable set.
2.3.1. Classifications9:
SPECIES (proportions)
HEIGHT-AGE (log transformed and normalized)
STRUCTURE10 (G,N,CC, …; log transformed and normalized; log odds for CC)
DEAD11(G,N; log transformed and normalized )
2.3.2. Variable Selection:
2.3.2.1. XVARSELV1.csv:
TVID
VARNAME
VARSEL (Enter X, Y or N or …)
2.3.3. Fuzzy C-Means Algorithm:
\Rwd\Python\config\cmeansConfig.txt
\Rwd\Python\lviClassification_v2.py
2.3.4. Outputs
xCLASS.csv
xCENTROID.csv
xMEMBERSHIP.csv
Assignment of classes to each observation
CENTROID statistics
Class memberships for each observation
2.4. Join Classes (xCLASS.csv) to (Reference) dataset and produce box plots:
2.4.1. Join datasets:
/Rwd/Python/Config/innerJoinConfig.txt
/Rwd/Python/innerJoinLargeDataset.py
2.4.2. Produce box plots
2.4.2.1. Indicate (Y) variables to be displayed with reference to (each) classification:
2.4.2.1.1. Fill out XVARSELV1.csv (for the Reference Dataset)
2.4.2.1.2. Run the R Boxplot Routine:
/Rwd/RScript/XCreateXorYVariableClassificationBoxPlots.R
Complete definition of the user inputs at the top of the routine
Run the script.
2.5. Variable Selection
The objective of this process is to identify the X-Variables that best explain the Y-Variables,
where the Y-Variables have been recast as classifications in accordance with 0.3.
9
Note that a system of site productivity classification should also be mentioned as a potential Y-variable (e.g. site
index) representation, but in the LVI case this system is included as part of the X-variable dataset.
10
Live trees
11
Dead trees
4
2.5.1. Select the X-Variables eligible for use (in both the Reference and Target data) in
MDA using the XVARSELV1.csv file format: Enter an X in the VARSEL column if
eligible otherwise enter Y or N.
2.5.2. Make ac copy of XVARSELV1.csv … the next R script will make changes to it.
2.5.3. Using the Reference dataset, run the variable selection procedure:
/Rwd/RScript/XIterativeVarSelCorVarElimination.R
Complete review and entry of user inputs.
Run the script.
See Appendix A for details regarding this procedure.
2.6. Run linear discriminant analysis for multiple X-variable datasets.
/Rwd/RScript/ XRunLinearDiscriminantAnalysisForMultipleXVariableSets.R
The results of this process are used by the analyst to then select a (set of) discriminant
function(s) with n-number of variables to determine similarities and differences
amongst the observations with respect to one (or more) systems of classification. In
most cases the number of variables to be included in a given set of discriminant
functions (for a given classification) will be between 5 and 10. The goal is to find the
largest KHAT statistic (Appendix B, Step 12) associated with a relatively small number of
variables (with the latter designed to reduce the potential for overfitting). In addition,
when looking at the variances (BTWTWCR4 in BWRATIO.csv; Appendix B Step 11)
explained in association with each classification; a better solution can be identified as
having the proportions of variance explained across a range of discriminant functions,
rather than heavily weighted toward the first axis, and so too, with a greater total
variance explained across all of the functions.
Lastly it is useful to also review the variables included in the equations, and to look at
the selected parameter relationships to distinguish amongst the classes. This can be
obtained as follows:
2.6.1. Create X- or Y-variable classification box plots
/Rwd/RScript/XCreateXorYVariableClassificationBoxPlots.R
2.6.1.1. Input file name (lviFileName); e.g.
lviFileName <- "E:\\Rwd\\QNREFFIN10.csv"
2.6.1.2. Input excludeRowValue and excludeRowVarName (in the lviFileName); e.g.
excludeRowValue = -1
excludeRowVarName <- 'SORTGRP'
2.6.1.3. Input the variable (in the lviFileName) containing the classes assigned to each
observations; e.g.
classVariableName <- 'CLASS5'
5
2.6.1.4. Input the name of the file containing the X or Y variable name selections; e.g.
xVarSelectFileName <- "E:\\Rwd\\XVARSELV1_BoxPlot.csv"
Note that this is the standard format XVARSELV1.csv with a filename suffix
(_BoxPlot or some other user defined suffix) before the .csv to indicate that this
is to be used for this purpose only.
2.6.1.5. Enter the variable selection criteria (“X” or “Y” or some other criteria); e.g.
xVarSelectCriteria <- "Y"
The routine will produce box plots for viewing only where the box plots describe the
range of data associated with each of the classes listed under the classVariableName.
This can help the analyst understand the influence of the indicator in terms of explaining
similarities and differences amongst the classes and as a result, provides a more
intuitive understanding of what the discriminant functions are doing.
2.7. Select preferred discriminant functions (VARSET) for each classification (including or
excluding dead tree attributes).
2.7.1.
6
Appendix A. The Variable Selection Process
/Rwd/RScript/xIterativeVarSelCorVarElimination.R
This routine requires some explanation because it has three layers with the top layer initializing and
managing an iterative routine, the second layer managing the main routine including running of
both Python and R scripts, and the bottom layer containing the individual scripts that are to be
activated as directed within the second layer.
There are a few restrictions on the current code that should be specified in advance:
1. Certain file names are identified as being flexible in naming and location convention, but as a
result of prior decisions, it is suggested these names not be changed.
2. Similarly all files that are held in common between R and Python routines should remain the
same name and in the same location, i.e. in the top of the “\\Rwd” directory.
3. Flexibility has been built into the process in anticipation of future developments.
The following is a more detailed description of what this routine is doing.
TOP LAYER
User inputs
a) The name of the file containing the data:
lviFileName <- “E:\\Rwd\\fileName.csv”






The user enters information to the right of: <- (an equals, “=”, sign in R).
Always include drive letter followed by colon (Best practice).
Always use double quotes when specifying locations of files.
Always use double backslash to indicate change in directory.
Filenames include full address of where file is located plus file extension.
Must be a comma delimited file.
b) User may choose to exclude certain rows (observations according to a criteria)
excludeRowValue <- 1
excludeRowVarName <- ‘SORTGRP’



The user enters information to the right of: <- (this won’t be repeated again)
If the file does not have a column labeled with the excludeRowNameVarName or there
is no value equal to the excludeRowValue then nothing will be excluded.
This program only allows for simple exclusions (according to variable name and 1 value);
more complicated exclusions should be done in preprocessing of the data if users are
going to use this script.
c) Indicate the classification (Y) variable to be used in the Discriminant Analysis
classVariableName <- ‘aClassification’
7


The classes are assigned to each variables according to those listed under a column
named ‘aClassification’
The classes may be numeric or characters in this case. R requires that numeric
assignments be declared as factors. This declaration is always applied to the variable
name entered here.
d) Identify the file containing the (X) variables to be selected for inclusion in the analysis.
xVarSelectFileName <- “E:\\Rwd\\\\XVARSELV1.csv”
xVarSelectCriteria <- “X”







This is a standard variable selection file used in many different analyses and workflows.
This document refers to the filename as XVARSELV1.csv but the file name can be
changed; for this specific routine the name should not be changed.
This file must contain the following column names: TVID; VARNAME; VARSEL
TVID is a unique id assigned to each row: must be integer.
VARNAME must be a variable name that is also contained in the data file.
Note that not all of the variable names in the data file must be included in
XVARSELV1.csv but if you want to select a variable to be included for analysis then it
must be included herein.
VARSEL use an X to indicate those variables that are to be selected as X-Variables, a Y- to
indicate those variables selected as a Y-Variable (this is generally true but not applicable
in this case); otherwise use an N to indicate that the variable is not selected.
e) Set the minimum and maximum number of variables to select.
minNvar <- 1
maxNVar <- 20



f)
The variable selection routine will explore different numbers of variables ranging from
the minimum, minNvar, to the maximum, maxNVar, as defined by the user (must be
integers).
It is generally recommended that the minimumNVar always be set to 1 since this
provides a benchmark for the single most important variable for explaining most of the
differences within a group.
Generally speaking 20 variables will be too many, in part because as the number of
variables increases there is a danger of overfitting (i.e. variables appear to add some
explanatory power when in reality they add nothing and may in fact produce worse
results; see http://en.wikipedia.org/wiki/Overfitting ).
Set the number of solutions to be generated for each number of variables.
nSolutions <- 10


For each number of variables between minNvar and maxNvar generate a given number
of variable selections equal to nSloutions.
The routine uses the R subselect package … using the ldaHmat routine. This routine
selects a specified number of variables at random and then starts substituting other
variables, also at random for those that already exist. If the substitution indicates a gain
8

according to a given criteria (see below) then the new variable replaces the old one; the
old one is then precluded from further inclusion thereafter.
With only 1 variable (e.g. 1) the tendency is of course to produce the same result for all
nSolutions but as the number of eligible variables to be included in the model increases,
so too does the number of different combinations of variables.
g) Set the criteria by which a given variable will be identified as being preferred over another
and therefore determining when there is no further gain to be found.
criteria <- "xi2"

There are 4 different criteria by which variables may be selected. The current process
only allows 1 criteria to be included at a time. Each of the criteria is to be identified in
quotation marks as indicated above. The Chi squared routine was adopted as the
preferred choice after some early testing, but this could be further explored. There
would be nothing wrong with running this routine repeatedly starting with different
criteria and then combining the results. The different criteria are as follows:
o “ccr12” Maximize Roy's first root statistic (i.e. largest eigenvalue of HE^(-1)
where His the effects matrix and E the error residual).
o “Wilkes” Minimize Wilks Lamda where lamda = det(E)/det(T) where E is the
error matrix and T is the total variance.
o "xi2" Maximize the Chi squared (xi2) index based on the Bartlett-Pillai trace
test statistic.
o "zeta2" Maximize Zeta2 (zeta2) coefficient where V = trace(HE^(-1)) and zeta2 =
V/(V+r), and where r is rank.
h) Enter the name of the unique variable selection file name and path.
uniqueVaqrPath <- “E:\\Rwd\\UNIQUEVAR.csv”


i)
Enter the name of the correlation matrix print file.
printFileName <- “UCORCOEF.csv”



j)
This file path and name should not be changed; it is included here only for the purpose
that someday this restriction will be changed.
The file is used to identify the unique variable names amongst all of the variables
combinations and it is one of those files (like XVARSELV1.csv) that is referred to in both
Python and R environments.
This the file that contains the Pearson correlation coefficients amongst allpairs of
variables contained within UNIQUEVAR.csv
This file name should not be changed.
Note that the file path is not included – the file will be written to the current working
directory (see below).
Enter the file containing a count of the number of eligible X-variables in XVARSELV1.csv
xVarCountFileName <- "E:\\Rwd\\XVARSELV1_XCOUNT.csv"
9

The use of this file is explained below. The file will not exist at the start of the run – it is
to be initiated as part of the process, and … the file name should not change.
k) Set working directory.
wd <- “E:\\Rwd”

Note that the working directory is where R expects read files, unless otherwise directed,
and to write files. Note also that there are no double backslashes, “\\” at the end in this
case. Lastly where files are being used by both Python and R they should generally be in
the top of the Rwd directory. Files that are being read into R such as the data file can be
located anywhere. The reason for this goes back to the original design of the process
with integration of R and Python code … starting with the idea that all of the input and
output files would be placed in a common directory. Interactions between the Python
and R code would be also based on the use of a standard set of filenames. Recent
developments have been moving toward a more flexible input-output file naming and
location convention but this has not been fully implemented so the recommendation is
that the older standard and design rules be maintained until such time as there is a
change.
Begin Processing
Step 1. Initialize the XVARSELV1_XCOUNT.csv file
This simply counts the number of X-variables in the XVARSEL1.csv file. This is needed to stop the
routine when there are no more variables to be removed as a result of high correlations. The file
consists of only 1 number. A python module (COUNT_XVAR_IN_XVARSELV1.py) is called to do
this.
Step 2. Read XVARSELV1_XCOUNT.csv
This step involves reading the XVARSELV1_XCOUNT.csv to initializes the count of X-variables
within the R programming environment.
Step 3. (A) Initialize the variable selection process
This process calls another set of R scripts (Middle Layer):
/Rwd/RScript/ZCompleteVariableSelectionPlusRemoveCorrelationVariables.r
This routine is described in more detail below. It runs through the entire variable selection
process exploring solutions with a number of variables ranging from minNvar to maxNvar, and
with nSolutions generated for each number of variables. The criteria used to select the best
variable set is also defined under the User Input section. Once the task is completed the results
are written to a file: XVARSELV.csv; a unique list of variables is extracted and written to
UNIQUEVAR.csv (or as defined by users), the variables are ranked according to their relative
importance and written to VARRANK.csv; the Pearson correlation coefficients are calculated for
each pair of variables in UNIQUEVAR.csv; where the correlations are  0.8 or ≤-0.8 the variables
within each pair that are of least importance (VARRANK.csv) are deselected from XVARSELV1.csv
10
as being eligible for selection the next time around; XVARSELV1_XCOUNT.csv is updated with
the new X-variable count.
Step 3. (B) Bring the newly updated X-variable count into the R environment.
Step 4. Iterate steps 3(A) and 3(B).
Stop when there are no more changes in the number of selected X-variables indicated in
XVARSEV1_COUNT.csv; correlations amongst the unique set of variables are all < 0.8 and > -0.8.
MIDDLE LAYER
/Rwd/RScript/ZCompleteVariableSelectionPlusRemoveCorrelationVariables.R
This is a layer that the user does not generally have to see unless there are certain errors that are
hard to detect. All of the user inputs to this layer are set in the top layer. However if the process
fails to run completely then it can be helpful to open and run this script in order to reveal more
about the probable source of the error. By running the top layer, the user inputs will have already
been initiated; beyond that point errors are more easily isolated to specific steps by then (re)running
the middle layer. Diagnostics will then indicate where the failure occurred according to each of the
steps highlighted below, and if necessary the individual R scripts or Python modules can then be run
in an interpreter to further diagnose the cause of the error. Generally the error will be associated
with an issue relating to the data or user defined inputs and not the programming as a result of this
code having been tested and used on many different machines. Most of the experience is in a
Windows 7 operating environment.
As an additional point the use of routines prefaced by Z are to indicate that these scripts are called
by other scripts and should not be altered unless the routines are then also going to be renamed
and incorporated into a new workflow process (outside of this routine that is currently under
discussion).
Step 1. Load dataset (as assigned to lviFileName under User Inputs identified above)
source("E:\\Rwd\\RScript\\ZLoadDatasetAndAttachVariableNames.r")
This middle layer is formulated almost entirely by source (or system) commands that call specific
R scipts (or Python modules in the case of the system command). This script loads the dataset
into R and “attaches” the variable names in the first row to the corresponding columns. It is
generally recommended that the first column consist of unique (primary) identification numbers
or names assigned to each row (observation).
Step 2. Exclude rows with excludeRowValue associated with excludeRowVarName
source("E:\\Rwd\\RScript\\ZExcludeRowsWithCertainVariableValues.r")
Step 3. Declare classification variable as Factor (as assigned to classVariableName under User Inputs)
source("E:\\Rwd\\RScript\\ZDeclareClassificationVariableAsFactor.r")
11
Step 4. Complete variable selection (xVarSelectFileName; XVARSEV1.csv) according to criteria
(xVarSelectCriteria)
source("E:\\Rwd\\RScript\\ZSelectXVariableSubset_v1.r")
Step 5. Remove variables with 0 standard deviation
source("E:\\Rwd\\RScript\\ZRemoveVariablesWithZeroStandardDeviation.r")
Step 6. Load the subselect R package
source("E:\\Rwd\\RScript\\Loadsubselect-R-Package.r")
This is the R package containing the key routines used to do the variable selection. There are a
number of routines available to do variable selection in this package, all of which are loaded into
memory when the package is called as in this case.
Step 7. Run the subselect linear discriminant analysis
source("E:\\Rwd\\RScript\\RunLinearDiscriminantAnalysis_subselect_ldaHmat.r")
This routine initializes linear discriminant analysis using the selected X-Variable data from
lviFileName and the class assignments made according to the classVariableName. The
classVariableNames form the Y-variables in multiple discriminant analysis.
Step 8. Run the subselect linear discriminant analysis variable improvement routine
source("E:\\Rwd\\RScript\\ZRun-ldaHmat-VariableSelection-Improve.r")
This routine implements the variable selection procedure.
Step 9. Extract the variable name subsets from the variable selection process
source("E:\\Rwd\\RScript\\ExtractVariableNameSubsets.r")
This script creates the file for writing to VARSELECT.csv in the next step.
Step 10. Overwrite VARSELECT.csv
source("E:\\Rwd\\RScript\\ZWriteVariableSelectFileToCsvFile.r")
Note that if VARSELECT.csv already exists in the R working directory (/Rwd/) then it will be
overwritten. The VARSELECT.csv contains the results of the variable selection as follows:
UID
This is a unique id associated with each row.
MODELID
The model number represents a particular solution as defined in terms of a
particular solution with a certain number and kind of variables selected. A
particular solution in terms of number and kinds of variables may occur more
than once and therefore be represented under more than 1 MODELID.
SOLTYPE
This represents the number of variables associated with each MODELID.
12
SOLNUM
This is 1 to nSolutions (user defined; see Top Layer-f) for each solution type to
which a unique MODELID has been assigned.
KVAR
This is the variable number within a given MODELID.
VARNUM
This is the variable number that is identified as being in the equation using the
subselect R package. It relates to the location of the variable in the dataframe.
VARNAME
This is the variable name associated with VARNUM.
Step 11. Run python code EXTRACT_RVARIABLE_COMBOS_v2.py
system("E:\\Anaconda\\python.exe E:\\Rwd\\Python\\EXTRACT_RVARIABLE_COMBOS_v2.py")
This module identifies the unique variable sets and organizes them into a new file VARSELV.csv;
The purpose of this file is to allow for all of the unique variable selection models to be processed
using discriminant analysis for comparison purposes. Another file, UNIQUEVAR.csv provides a
list of all of the variables referred to in VARSELV.csv.
Note that this is the first example of the use of a system command to call a program written in
another language. It is critical that the full path be identified for the compiler at the start of the
statement followed by the full path of the python (or other kind) of routine. Note the use of the
double backslashes and the use of double quotation marks, not single quotation marks. These
rules can be violated and may still work on any given machine but it has been found that these
rules are needed to ensure that the program works properly on most (if not all) machines.
Step 12. Run python code RANKVAR.py
system("E:\\Anaconda\\python.exe E:\\Rwd\\Python\\RANKVAR.py")
VARSELECT.csv (output from step 10) is used as input into this module. This module ranks
variables based on their position (where they first appear in the process starting from the user
defined (Top Layer – e) minVar and moving toward variable selections with an increasing
number of variables up to maxVar. The results are printed to VARRANK.csv. Note that if the file
already exists in the working directory (/Rwd/) then it will be overwritten. VARRRANK.csv
contains the following fields:
VARNAME:
The name of each unique variable identified in all of the model runs. Note that
this is somewhat redundant with the same listing of variables in
UNIQUEVAR.csv; this routine was developed at a later stage and performs a
different function.
RANK
This is ranking by way of association with the number of variables that enter
into the equation by way of its first appearance amongst all of the models. The
lowest number of variables is recognized as the highest rank, e.g. 1 is ranked
higher than 2.
13
p
This is the number of models that contain a given variable divided by the total
number of variables.
IMPORTANCE This is equal to: SQRT(1/RANK * SQRT(p)) ranging between 0 and 1 with a high
number indicating a high importance.
The highest ranked variables also tend to occur more frequently and therefore of greatest
importance, but as the number of variables increases this relationship starts to break down.
This breakdown is related to the notion of over fitting, re-enforcing the idea that a lesser
number of X-variables may provide the most consistent and reliable indicators of the class to
which a given observation should belong to.
Step 13. Reload LVI dataset
source("E:\\Rwd\\RScript\\ZLoadDatasetAndAttachVariableNames.r")
This is a repeat of step 1 in the Middle Layer. This may not be necessary but is designed to
ensure that all of the data is available to (re)select the unique variables listed in UNIQUEVAR.csv
(see steps below).
Step 14. Exclude rows with excludeRowValue associated with excludeRowVarName
source("E:\\Rwd\\RScript\\ZExcludeRowsWithCertainVariableValues.r")
Repeat of step 2 above.
Step 15. Select the variable set as identified in UNIQUEVAR.csv
source("E:\\Rwd\\RScript\\ZSelectUniqueXVariableSubset.r")
This is a new step using UNIQUEVAR.csv instead of XVARSELV1.csv to do the variable selection.
Step 16. Generate Pearson correlations amongst each and every pair of variables
source("E:\\Rwd\\RScript\\CompileUniqueXVariableCorrelationMatrixSubset.r")
Step 17. Create a correlation matrix for printing
source("E:\\Rwd\\RScript\\CreateUniqueVarCorrelationMatrixFileForPrinting.r")
Step 18. Write correlation matrix (to current working directory)
source("E:\\Rwd\\RScript\\ZWriteUniqueVarCorrelationMatrix.r")
This routine creates a new file UCORCOEF.csv. The following fields are created:
VARNAME1
The first variable name.
VARNAME2
The second variable name. Note that the first variable name can be the same as
the second variable name. When eliminating variables (implemented in the
steps below) with high correlations ( 0.8 or ≤ -0.8) the conditions where
VARNAME1 and VARNAME2 are the same are ignored.
14
CORCOEF
The Pearson correlation coefficient.
When the program is finished iterations to arrive at a final solution there should be no
correlation coefficients  0.8 except those where VARNAME1=VARNAME2 and there should be
correlation coefficients ≤ -0.8. As a result, one way to verify whether or not the process has run
to completion is to check this file to ensure that this condition has been met. If the condition
has not been met then some problem has been encountered that stopped the process from
completion.
Step 19. Update XVARSELV1.csv to remove least important variables from variable pairs that exceed
the correlation coefficient limits.
system("E:\\Anaconda\\python.exe
E:\\Rwd\\Python\\REMOVE_HIGHCORVAR_FROM_XVARSELV.py")
This routine uses UCORCOEF.csv, VARRANK.csv, and XVARSELV1.csv as inputs. XVARSELV1.csv is
modified and then the file is replaced by a new file with the same name. The process involves
checking for the condition where correlation coefficients are  0.8 (excluding the condition
where VARNAME1 and VARNAME2 are the same) or ≤-0.8; where one of these conditions holds
true it then checks the IMPORTANCE of each variable and finally changes the least important
variable assignment from an X to N, effectively deselecting the variable.
Step 20. Count the number of X-Variables remaining in XVARSELV1.csv
system("E:\\Anaconda\\python.exe E:\\Rwd\\Python\\COUNT_XVAR_IN_XVARSELV1.py")
This routine counts the number of variables with an X assigned to each one and then writes the
number to XVARSEV1_COUNT.csv. In the step that follows, this same number is then picked up
by the routines to determine if there has been any change from the previous run, as a result of
having found (once again, more) X-variables that were highly correlated at the end of the
current run. If there is no change between the previous and current run then the iterations are
stopped in the TOP LAYER
BOTTOM LAYER
The bottom layer is described in sufficient detail under the middle layer. If you want to know more it is
please refer to the individual pieces of underlying code that make up each of the steps in the middle and
top layers.
15
Appendix B. Discriminant Analysis for Multiple Variable Sets
/Rwd/RScript/ XRunLinearDiscriminantAnalysisForMultipleXVariableSets.R
TOP LAYER
User inputs
a) Identify the reference dataset:
lviFileName <- "E:\\Rwd\\interp_c2.csv"
b) Exclude certain rows from dataset:
excludeRowValue = -1
excludeRowVarName <- 'SORTGRP'

If there is a variable name that corresponds with the excludeRowVarName (‘SORTGRP’)
and if there are 1 or more rows where the excludeRowVarName is equal to the
excludeRowValue (‘SORTGRP’) then those rows will be excluded from the analysis.
c) Enter the name of the classification variable with the assigned class to each observation:
classVariableName <- 'CDEADL5’
d) Enter ‘UNIFORM’ or ‘SAMPLE’ to indicate the type of prior distribution:
priorDistribution <- 'SAMPLE'

Note that the ‘SAMPLE’ distribution is usually indicated when the data is represents a
random sample from the population. A ‘UNIFORM’ distribution is used to indicate that
the distribution of responses within the population is unknown. The prior distribution is
used to calculate a weighted covariance matrix. Another way to think of this is that the
prior distribution determines the relative weight assigned to each class in the process of
developing the discriminant functions.
e) Indicate the file name used to select the various sets of X-Variables.
xVarSelectFileName <- "E:\\Rwd\\XVARSELV.csv"
This file was an output from 2.5.3.
f)
Set working directory.
wd <- “E:\\Rwd”
Begin Processing
Step 0. Set working directory
setwd("E:\\Rwd")
16
Step 1. Load dataset called lvinew
source("E:\\Rwd\\RScript\\ZLoadDatasetAndAttachVariableNames.r")
Step 2. Exclude rows with excludeRowValue associated with excludeRowVarName
source("E:\\Rwd\\RScript\\ZExcludeRowsWithCertainVariableValues.r")
Step 3. Declare classification variable as Factor (i.e. identify variable as classification variable)
source("E:\\Rwd\\RScript\\ZDeclareClassificationVariableAsFactor.r")
Step 4. Get prior distribution to be used in Discriminant Analysis
source("E:\\Rwd\\RScript\\ZComputePriorClassProbabilityDistribution.r")
Step 5. Write prior distribution to file
source("E:\\Rwd\\RScript\\WritePriorDistributionToFile.r")
Step 6. Load MASS package
source("E:\\Rwd\\RScript\\LoadMASS-R-Package.r")
Step 7. Get the multiple variable subsets XVARSELV.csv
source("E:\\Rwd\\RScript\\ZSelectXVariableSubset_v2.1.r")
Step 8. Run Multiple Discriminant Analysis - Take One - Leave One
source("E:\\Rwd\\RScript\\RunMultipleLinearDiscriminantAnalysis_MASS_lda_TakeOneLeaveO
ne.r")
Step 9. Write two new files to the working directory - CTABULATION.csv - POSTERIOR.csv
source("E:\\Rwd\\RScript\\WriteMultipleLinearDiscriminantAnalysis_MASS_lda_TOLO_to_File.r
")
CTABULATION.csv is a cross tabulation of the reference data class versus the predicted class
produced as a result of a take-one –leave-one analysis.
VARSET
This refers to VARSET in XVARSELV.csv
REFCLASS
The class as originally identified in the reference data.
PREDCLASS
The predicted class.
CTAB
The number of observations associated with the REFCLASS-PREDCLASS
combination.
POSTERIOR.csv reports the UERROR associated with the take one leave one routine.
17
VARSET
This refers to VARSET in XVARSELV.csv
NVAR
The number of X-variables associated with the variable set.
UERROR
Note that the posterior error of estimation is calculated following the procedures of Hora and
Wilcox (198212; Equation 9):
𝑒̂
= 1 − 𝑁 −1 ∑𝑁
𝑖=1 𝑚𝑎𝑥 [𝑃(𝑌 |𝑋𝑖 )]
Eq. 1
Where,
𝑒̂
is the estimated (posterior) error
𝑁
is the total number of observations
𝑃(𝑌 |𝑋𝑖 )
is the probability of class Y, where Y is equal to 1 to m classes, given a
set of variables, Xi , where i equals 1 to n observations.
Hora and Wilcox (1982; P.p 57, 58) explain the calculation of this error as follows:
“… A superior alternative and one that is most frequently mentioned in the marketing
literature (Crask abd Perreault 1977; Dillon 1979) is the U-method or hold-one out
method. The ERE (error rate estimator) with the U-method is calculated by reserving
one observation from the training sample, calculating the classification rule without the
reserved observation, and then classifying the reserved observation and noting whether
the classification is correct. This process is repeated for each available observation. A
very small amount of bias is introduced because the classification rule is calculated with
n-1 rather than n observations. For large samples the ERE is nearly unbiased.”
“The posterior probability ERE for the jth population is simply 1 minus the average
posterior probabilities of all observations assigned to the jth population by the
discriminant function.”
Note that:
𝑒̂𝑗
= 1 − (𝑁𝑝𝑗 )
−1
∑𝑁𝑖=1 𝑌𝑖𝑗
Eq. 2
Where
pj
Yij
is the a priori probability of belonging to population j.
equals the probability the observation belongs to class j given Xi.
12
Hora, S.C. and Wilcox, W.B. 1982. Estimation of error rates in several-population discriminant analysis. Journal of
Marketing Research 19(1):57-61.
18
And
𝑒̂
=
∑𝑚𝑗=1 𝑝𝑗 𝑒̂𝑗
Eq. 3
Hora and Wilcox (1982) point out that while the formula does not explicitly include
consideration of the a priori probabilities (pj) , this is implied by equations 2 and 3, that
were used to derive equation 1. It is assumed that Equation 1 is implemented using the
normal distribution and variances associated with each class based on the class
assignments derived from the take-one-leave-one process. In general as the number of
variables increases the UERROR decreases; this is the opposite of Cohen’s (1960)
Coefficient of Agreement (also referred to as the KHAT statistic, described in more detail
below).
Step 10. Run Multiple Discriminant Analysis - All Data
source("E:\\Rwd\\RScript\\RunMultipleLinearDiscriminantAnalysis_MASS_lda.r")
Step 11. Write files - CTABALL.csv VARMEANS.csv DFUNCT.csv BWRATIO.csv
source("E:\\Rwd\\RScript\\WriteMultipleLinearDiscriminantAnalysis_MASS_lda.r")
CTABALL contains the contingency table data from which the Cohen’s (1960) Coefficient of
Agreement can be calculated.
VARSET
REFCLASS
PREDCLASS
CTAB
Refers to the variable sets in XVARSELV.
Refers to the reference or actual class.
Contains the predicted class.
Provides a cross tabulation indicating the number of observations in the
associated combination of REFCLASS and PREDCLASS.
VARMEANS contains the mean values for each variable by class.
VARSET2
CLASS2
VARNAMES2
MEANS2
Refers to VARSET in XVARSELV.
Is the class as originally assigned (not predicted).
Refers to the variable names in the original dataset and as selected for inclusion
in the X-variable set.
Refers to the mean value associated with each variable (VARNAMES2) and class
(CLASS2) combination.
Note that the same variable may occur in several variable sets – producing redundancies.
DFUNCT contains the discriminant functions for each axis and combination of variables.
VARSET3
Refers to VARSET in XVARSELV.
VARNAMES3 Refers to the variable names in the original dataset and as selected for inclusion
in the X-variable set.
FUNCLABEL3 Refers to the discriminant functions in order of priority (ranked from highest to
lowest eigenvalue as follows: LN1, LN2, … up to n-1 functions where n is equal
19
to number of classes or the number of variables selected for inclusion in an
equation, whichever is the lesser of the two.
BWRATIO contains the between-to-within variance ratios of the differences in class Z-statistics
associated with each discriminant function.
VARSET4
FUNCLABEL4
BTWTWCR4
Refers to VARSET in XVARSELV.
Is identical to FUNCLABEL3 in DFUNCT.
Are the ratios of between to within group variances explained by each
discriminant function (FUNCLABEL4). These can be converted to proportions by
dividing each ratio by the sum of all ratios for each variable set13. These are
similar to, but not the same as eigenvectors14.
Step 12. Generate Cohens (1960) Coefficient of Agreement (KHAT Statistic15,16)
system("E:\\Anaconda\\python.exe E:\\Rwd\\Python\\COHENS_KHAT_R.py")
𝐾𝐻𝐴𝑇 =
N..
Nij
Nij=I
Ri
Cj
𝑛
𝑁.. ∑𝑛
𝑖=1 𝑁𝑖𝑗=𝑖 −∑𝑖=1 𝑅𝑖 𝐶𝑗=𝑖
𝑚
𝑁..2 −∑𝑛
𝑖=1 𝑅𝑖 ∑𝑗=1 𝐶𝑗
Eq. 4
is the total number of observations across all categories.
is the total number of observations assigned to (user) class i and (producer) class j.
is the sum of the observations assigned to the same class (j=i) by both the user and the
producer along the diagonal.
is the total number of observations assigned to (user) class i (in rows) equals 1 to n.
is the total number of observations assigned to (producer) class j (in columns) equals 1
to m.
Note that the user in this case refers to the original class assigned to a given observation and the
producer refers to the estimated class by way of discriminant analysis using a take-one-leaveone type of process.
CTABSUM.csv contains the following:
VARSET
OA
KHAT
MINPA
MAXPA
Refers to VARSET in XVARSELV
Overall accuracy (the proportion of observations where users and producers
agree on classification)
Overall accuracy minus the proportion of observations that could have been
agreed to by chance
Minimum producer accuracy
Maximum producer accuracy
13
http://www.r-bloggers.com/computing-and-visualizing-lda-in-r/ [accessed April 23 2014]
http://en.wikipedia.org/wiki/Singular_value_decomposition [accessed April 23 2014]
15
Cohen, J. 1960. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement
20(1): 37 – 46.
16
Paine, D.P. and Kiser, J.D. 2003. Aerial photography and image interpretation. Second Edition. John Wiley and
Sons, Hoboken, NJ, US. Pp.465-480.
14
20
MINUA
MAXUA
Minimum user accuracy
Maximum user accuracy
Step 13. Combine evaluation datasets into one file: ASSESS.csv
system("E:\\Anaconda\\python.exe
E:\\Rwd\\Python\\COMBINE_EVALUATION_DATASETS_R.py")
ASSESS.csv contains information assembled from CTABSUM.csv; POSTERIOR.csv; and
XVARSELV.csv
21
Download