Data Mining Process Overview

Data Mining Process - Business understanding, Data understanding , Data preparation,
Modeling, Evaluation, Deployment.
Two types of categorical variables are often distinguished – nominal and ordinal
Two types of quantitative variables are often distinguished – continuous and discrete
Rows and columns are identified by integer or string labels. The set of row labels is called
the index and the set of column labels is called columns.
Imputation is the estimation of missing values with descriptive statistics or predicted values.
Simpler methods of imputation that use the feature's mean, median, or mode are valid only if
the missing values are random.
The data_frame.dropna() function removes rows with missing data from a data frame.
The data_frame.fillna(value, method) function replaces a missing value by either a specified
value or a value resulting from a method.
Standardization brings a feature's values to a small range centered near 0 by computing:
(observation - mean) / standard deviation), called a z-score.
Normalization is rescaling a feature's values to the range [0,1] by computing: (observation min) / (max - min).
The goal of supervised learning is to predict a particular feature's value based on other
features' values – KNN , Logistic reg
Unsupervised learning methods do not attempt to predict an output value but instead detect
and identify patterns and relationships in data. – cluster analysis, association rules.
Partitioning is the process of splitting the data into training, validation, testing.
The gini index I(A) = 1-sum(Pk^2), the overall Gini index is the weighted average of the indices
of the partitions.
Entropy = - Sum(Pk * log2Pk)
Information gain is defined as the entropy of an attribute minus the weighted entropy of each
partition of that attribute.
Accuracy – (TP+TN)/ALL
Precision – TP/(TP+FP)
Recall – TP/(TP+FP)
Bootstrapping is the process of generating simulated samples by repeatedly drawing with
replacement from an existing sample.