Rapid Miner theory 120507

advertisement
Rapid Miner Theory
NOTE: Always use “.” (dot) as a decimal point. Some Excel files use commas, so they must be changed to dots.
NOTE: The Excel version to use has to be Excel 1997-2003
Interface
Important sections (check figure below):
Operators (where we can select the operators) and repositories (where we can select the data and processes already
built previously)
Main process (where we put the building blocks of our method using drag-and-drop from the operators window)
Run (to run the process when it is designed on the main process window)
Build (to change to the main process window) and Results (to show the results)
On the right we have definition of parameters to each process and bottom right a help window with definitions.
Note: when reading a help window, click on F3 to freeze it. It’s easier to read.
OPEN FILES
1. by creating a file in the repository
1.1. Files can be imported from Excel (only version 2003) or CSV (obtained from Excel), as well as other types:
1.2. In case the data is imported, on step 3 we must define what the first lines are exactly (Name, comment, Unit). In
case there is just one line with the name of the variables, this is defined as Name.
1.3. Then on step 4 the column types have to be defined. One can select which variables to use by checking the
boxes on the top. They can be nominal, bi-nominal or poly-nominal (for the case of being “text” variables with
multiple choices, ex: groups A, B or C). Variable WEATHER is polynomial because it has more than 2 possibilities
(sunny, overcast and raining). Variables LIGHT, GROUND and UMPIRE are binomial, because they only have 2. The
variables could also be numeric (numeric, integer, real), text, or date. Column SAMPLE is text because it is the name
of the samples.
1.4. Select if a column is an id (as SAMPLE, which is the name of the samples), or an attribute (normal “X” variables),
or labels (“Y” variables), among other possibilities. Labels are the definitions of classes, or “Y” variables.
1.5. Then save this new data table in one of the existing repositories already created. We can create folders inside
the repositories, to save different projects.
1.6. Finally, one can drag and drop the data set “data decision trees” from the Repositories into the main process,
where a box called “Retrieve” will appear. Notice that data and processes have different icons in the “repositories”
window. The data is now ready for analysis:
2. by opening a file inside the main process
2.1. First search for the “Read Excel” (or csv) operator, by typing “excel” in the search box on the left. The files with
.csv extension allow one to work with more than 256 columns. In that case use “Read csv”.
2.2. Drag and drop the operator on the left into the main process in the center
2.3. Then click on the “Import configuration wizard” on the right and follow the steps as in 1.2 to 1.4.
3. Our first simple model (decision tree on nominal data)
3.1. Let’s build a model using a decision tree (on the data from the input section below) to tell us if under certain
conditions of weather, a cricket match will be allowed to PLAY or not to PLAY.
3.2. Once the data is already loaded according to the previous sections, we need to include the decision tree model,
which we do by drag and drop from the Operators menu.
3.3. We have to connect the ports to each other, then click on PLAY (on the top). There are
3.4. Because we only connected the decision tree “mod”el port to the “res”ult port, we only receive the graphical
model (on the left) and the results in text format “on the right”:
3.5. If we had connected also the “exa”mple port to the “res”ults, we would also have an extra results window, with
the data for this example:
4. A predictive model (linear discrimination with numerical values)
4.1. First let’s define the data in Excel (below we have Excel’s calibration set on the left and prediction set on the
right). Notice that the calibration set has column “group” defining the group that each sample belongs to. That
column does not exist in the prediction set. That is what we want to predict.
NOTE: The decimals are separated by dots, not commas
4.2. When we import the data we can define the “ID” as text or polynominal (as well as the “labels”, if they exist).
The attributes with numerical values are automatically recognized as “real”. Below is the calibration set during input:
4.3. Then we have to design the main process:
- Read calibration file (Read CSV) and build a model (LDA).
- Read the prediction file (Read CSV(2)) and predict it connecting it to the “unl”abelled port of “Apply Model”.
- Notice that the model learner “LDA” is connected to “Apply Model” by the “mod”el ports.
4.4. The results are presented for the prediction data. The “prediction” column contains the prediction set results.
5. Evaluate the performance of a model using external set
5.1. we are going to build a k-NN model with nominal variables and evaluate it using cross-validation (in rapid miner
its called “X-validation”) and predict an external set. We build the model on the left, then double-click on the blue
icon in the “validation” operator. Another window appears, which represents the inner-part of the validation (right).
NOTE: Both datasets must have labels!
5.2. There are several parameters that can be demanded in the “performance” operator. To recover them we have
to connect the “ave”rages port in the operator “validation” to the results. It gives us the confusion matrix
6. Evaluate the performance of a model with internal cross-validation
6.1. we are going to build a support vector machine classification model (two classes, binominal) and evaluate its
performance using cross-validation. We had to use select attributes (to delete one of the attributes that was not
needed), set “label” role, and “X-validation”, as well as “Performance – binominal classifier”. Notice that the only
“validation” output port connected is the “ave”rage. Connecting the other ports deletes the results we want.
The results obtained may be selected in the “performance” operator. It gives among others, a confusion matrix.
Operators:
PROCESS CONTROL
Multiply - In case we want to calculate two methods at the same time, we can use the operator multiply
IMPORT
Read model – to apply a model that has been designed before (see also “example source” and “model applier”)
DATA TRANSFORMATION (important)
Name and role modification
Rename – to simply rename a variable (there are several types of “rename”)
Set role – to change the role of one of the variables (change for example UMPIRE DECISION from “label” to “id”). The
operator “Exchange roles” implies that one variable changes with another
Type conversion
Discretization – puts numerical values from a variable into bins (there are several types of discretization)
Nominal to numerical – example: variable weather containing 3 possible values (overcast, raining, sunny) becomes 3
variables of weather containing dummy variables (0 or 1) for each of the three possibilities.
Parse numbers – to change nominal values of variables to numerical.
Attribute set reduction and transformation
Generation
Generate ID – creates sample names (vector from 1-n). It deletes the former ID in case it exists.
Generate attributes (or generate empty attributes) – to generate variables (squares, etc)
Transformation
Principal component analysis – PCA implementation. The port “exa” gives the scores and the port “pre” gives
eigenvalues and loadings
Independent component analysis – ICA implementation. The port “exa” gives the scores and the port “pre” gives
eigenvalues and loadings
Singular value decomposition – SVD implementation. The port “exa” gives the scores and the port “pre” gives
eigenvalues and loadings
Self-organizing map – creates self-organizing map
Selection
Select attributes – to select attributes (variables) from the data set. Note: invert selection works too
Value modification
Numerical value modification
Normalize – performs UV-scaling on the attributes (select z-transformation). There is also operator denormalize.
Scale by weights – allows one to multiply the attributes by pre-defined weights (if weight is zero, the attribute is
deselected from the dataset, if it is 1, it does not change at all)
Data cleansing
Outlier detection
Detect outlier (several possibilities) –
Filtering
Sampling
Sample (several types) – select random samples from the dataset
Sorting
Sort – sorts the samples according to the specifications
Rotation
Transpose – transposes the data set
Set operations
Append – to concatenate tables (vertically, to add lines to a table)
MODELING
Classification and regression
Lazy modeling
Default model (supervised classification) – don’t understand what it is
k-NN (supervised classification) (Numerical and/or nominal variables) – basically there is no calibration step, but
there is a calibration set in which the calibration samples have defined classes. In the prediction step, we measure
the distance (or other measure) to all the calibration samples. Then we select the “k” minimum distances (eg. 5 or
10) and see how many of these selected samples are from each class. The more represented class in this group wins.
There are numerical, nominal and mixed distances available.
The only type of result one gets from it is below. Confidence tells us how many samples in the k defined belonged to
the class of the prediction for a certain sample. In this next example (table of results below), with 3 classes, if k=4, for
the class prediction of sample NR05 (that should be class c1) we obtain a correct prediction. In the confidence
columns we can see that 0.75 (=3) of the 4 samples with the smallest distance to this sample were from class 1 and
0.25 (=1) of these samples was from class 2. It obviously decided for class 1.
Bayesian modeling
Naïve Bayes (supervised classification) (Numerical and/or nominal variables) – For the calibration step the variables
are considered separately and the distribution for each of the groups is calculated. I think that the probability of a
sample belonging to a class is given by the multiplication of the probabilities in all the variables (don’t know exactly
how).
Otherwise, a region with a certain size is selected around the point to predict, and the number of samples of each
class inside that region gives the probability of belonging to that class (as in the following link)
http://www.statsoft.com/textbook/naive-bayes-classifier/
Use Laplace correction to prevent high influence of zero probabilities.
As an example the distribution of classes in variable w410 is shown graphically (left) and the results for some of the
variables (average and standard deviation) is presented in a table (right).
The main process used was (notice that “apply model” connects “mod”el to “results”, to get the distribution plots):
Naïve Bayes kernel - is similar to the Naïve Bayes, but with more parameters. Probably more accurate.
Neural Networks
Perceptron (supervised classification) – only allows two classes
Neural net (supervised classification) – Allows one to use a calibration set to create the neural network and then
predict a new set of samples. Must select many different parameters (below, right).
The type of results for the predictions is similar to other classification methods (below, left). Otherwise, it also gives
the Nodes’ values (below, right)
AutoMLP (optimization of neural networks’ parameters) - simple algorithm for both learning rate and size
adjustment of neural networks during training. Don’t know details
Support vector modelling
Support vector machine (supervised classification) – Uses support vector machines to separate two classes.
http://www.statsoft.com/textbook/support-vector-machines/
NOTE: labels have to be a binominal variable (can only separate two groups at a time).
The type of results is similar to other classification methods (probability for the class and prediction table).
Some of the SVM models available may also present as results the performance, the weights of the variables (below,
left) and the support vectors (below, right)
The basic flow is presented below. There are optional parameters that vary according to which SVM is selected
Fast large margin is a SVM for large datasets.
Hyper hyper is a minimal SVM model, built with only one sample from each class. To use with boost methods (??!!).
Discriminant analysis
Linear discriminant analysis (supervised classification)(numeric variables, only for two-class problems) - A linear
discriminant function for binominal labels and numerical attributes. Use binomial values on the label.
Note: All the other methods in discriminant analysis work the same way as the linear one.
Polynominal by binominal classification (SVM-based supervised classification)(numerical variables) – performs
sequentially for all classes a SVM between one class and another class (or all the other classes). This way one can
separate multiple classes.
Clustering and segmentation
k-means (unsupervised clustering)(numerical values) – Given a fixed number of (desired or hypothesized) k clusters,
assign observations to those clusters so that the means across clusters (for all variables) are as different from each
other as possible.
http://www.statsoft.com/textbook/cluster-analysis/#vfold
One only needs to supply the number of groups (k) expected. In the data there should be no “label” variable.
Attention to the connections of the “clustering” operator. There may be a bug in that.
The results are the folder view (left), graph view (middle left) and centroid table (middle right) for building the
model. The predicted results are presented also (right).
k-means cluster - I don’t know how to predict new samples with this one.
k-means fast – the same as k-means but it is faster for larger datasets and number of groups. Needs more memory.
X-means – it is as k-means, but it optimizes for the number of clusters. We have to supply a minimum and maximum
number of clusters.
Agglomerative clustering (non-supervised method) – dendrogram of the “calibration samples”, there is no
prediction of external set.
Flatten cluster (non-supervised method) – uses a previous clustering hierarchical method to apply the method to a
prediction set.
Extract cluster prototypes – shows the averages or whatever characterizes the cluster (needs a clustering method
before it)
Correlation and dependency computation
Correlation matrix – calculates the correlation matrix and a vector of weights based on correlation
Covariance matrix – calculates the covariance matrix
Mutual information matrix – calculates the mutual information between two variables. The result is a matrix.
Cross-distances – calculates the distance from each of the objects in a set to the closest object in another set. It can
also calculate for the closest n objects. It can also calculate the distance to the farthest object(s)
Model application
Apply model – applies a model to a predictive dataset (useful for predictions)
Evaluation
To load a process and run it (mudar o texto e a figura)
This getting started process demonstrates how to load (retrieve) a model from the repository and apply it to a
data set. The result is a data set (at the lab output for "labeled data" ) with has a new "prediction" attribute
which indicated the prediction for each example (ie. row/record).
You will need to adjust the path of the retrieve data operator to the actual location where the model is stored
by a previews execution of the "1. Getting Started: Learn and Store a Model" process (or execute the
disabled operator at the top of the process).
Download