Read Me

advertisement

TPRTI_A and TPRTI_B

A-GENERAL DESCRIPTION

TPRTI_A and TPRTI_B are currently implemented as Microsoft C# Form applications

This version has been implemented without real concern for speed. For that reason the program interfaces with R. R is used to sort the dataset, to fit a linear regression to a node and passed back to C# results via temporary files. When the application runs, temporary files are stored in various folders. Section B describes the needed folders. The following link leads to a tutorial on how to interface C# with R. http://joachimvandenbogaert.wordpress.com/2009/03/26/r-and-c-onwindows/

A node is represented by a class called Node

A mean point is represented by a “struct” data structure.

A turning point is a mean point which angle represents a “sharp turn”. Sharp turn are analyzed by computing the cosine of the angles between two consecutive vectors.Vectors are formed by consecutive mean points

B-Folders

A temp folder is needed. The temp folder has a lot of sub-folders as shown below,

Folders are hard-coded in the program. Users may modify the program to use different folders.

C:\\Temp\\R_wd\input\\ => This folder hosts the training and the test dataset files.

C:\\Temp\\R_wd\\output\\ => When the program runs this folder hosts various temporary files used by R software and C# to pass parameters.

C:\\Temp\\R_wd\\input\\DataDescription\\ => The program needs to know the data type of each attribute. This means it needs to know if the attribute column hosts discrete values or continuous values. This information is put it into a separate file called “description file”. The description file must EXACT filename as the training dataset file.

C:\\Temp\\R_wd\\output\\temp\\split_node\\TPRTI\\ => When the node dataset is split, the associated turning points are split as well. This folder contains several turning point files, each associated with a node

C:\\Temp\\R_wd\\output\\temp\\trace\\ ==> for debugging purposes. Contains temporary split dataset for each node

C:\\Temp\\R_wd\input\\test ==> this is used when a test file is uploaded.

C-DATASET

Two different files are needed to induce a model: a training dataset and a description file. Both are .csv files. Both have exactly the same name but reside in two different folders.

Training dataset file

The training dataset is used to train the model. The output variable is always located in the first column of the training dataset. TPRTI accepts real valued variables either discrete or continuous.

The dataset should be placed in folder C:\\Temp\\R_wd\input\\

Description file

The description file is used to tell if a column is a continuous variable or a discrete variable. It is a one-column file with each row describing an attribute. The first row always describes the output variable and is always set as continuous. The second row relates the first input attribute, and the third, the second input attribute, etc. The below picture depicts content of a description file. In this example, the description file shows that all attributes are continuous

The description file is placed in folder C:\Users\paul\Desktop\R_wd\input\DataDescription.

Another example of description file where all attributes are discrete except the output variable

(which is always in the first column) is shown below

D-How does it run

R must be installed.The R server must be running.

Please follow the tutorial http://joachimvandenbogaert.wordpress.com/2009/03/26/r-and-c-onwindows/ To install both R and the R server

Below is the form for TPRTI-B.

First you use the “Upload a File..” button to upload the training data which must initially be placed in the C:\\Temp\\R_wd\input\\ folder. Next, set the input parameters and the stopping parameters. The “interval Set Size” is the size of the subsets. This parameter is critical since it influences the general trend. The “Cosine Threshold” is the cosine threshold below which an angle is classified as “turning point angle”.

The stopping criteria “Set Size(%)” is the minimum number of examples required in a node. This is expressed as a percentage of the initial training dataset.

The “Improv Vari less than (%)” is the minimum improvement of the weighted RSS from a parent to child nodes. In other words, if splitting current node does not improve the RSS by at least the provided number, the program does not split. The MSE and Total RSS in all nodes is provided as well as the overall MSE and Total RSS in the leaf-node

The model are provided in column “Node Coeff” by the coefficients of the input variables and the y-intercept constant.

For each node, the model is provided as y=c

0

+ c

1 x

1

+ c

2 x

2

+…+ c n x n

where n is the number of input variables. The display is done in this manner: c

0

<space>c

1

<space> .. <space>c n

If a variable does not participate in the equation its coefficient appears as “NA” instead of 0.

Therefore, for any useful purposes, NA must be replaced by 0. The largest distance to the node model is displayed in column “Distance”. Finally the green areas are for the testing.

TPRTI-A uses similar form and hence will not be described.

E-The program

The program itself is written in a straightforward manner. Class Node contains everything about a node. Node datasets are stored in temporary files. However, the name of the file is stored in the node object.

Method public List < Node > dev_tree( Node node, List < Node > Node_List)

Takes in a node list, and returns a final node list which is the final tree. First, it creates the mean points set then computes the turning points set. Both sets are stored in files. It then calls grow_tree2 method to develop the tree. public List < Node > grow_tree2( Node node, List < Node > Node_List, List < mean_point >[] arr_tp_list, string [] ATTRIBUTE_NAMES, STATCONNECTORSRVLib.

StatConnector sc1)

Node node: this is current node that need be evaluated for a possible split

List < Node > Node_List: This is the tree being grown

List < mean_point >[] arr_tp_list: A list of turning points. Recall that turning points are nothing but a subset of the mean points set. string [] ATTRIBUTE_NAMES: List of the attribute names

STATCONNECTORSRVLib.

StatConnector sc1: this is needed to launch R

The method grow_tree needs to evaluate each node. To do that it calls method Find_Split_PT2 which returns the split point.

Download