Vishnu

advertisement
Scikit Learn
Installation:
It can be easily installed using Anaconda Python (https://www.continuum.io/downloads ). If you
have already installed python and want to install Scikit Learn then you can install it from scikit
learn Website (https://www.continuum.io/downloads ). The easiest way is to use Anaconda. If
you want to install without using Anaconda then you have to use pip command pip install -U
scikit-learn.
I faced few issues installing scikit in CentOS 6.6. It was unable to install the
backend GTKAgg, GTK3Agg, GTK, GTKCairo. The comprehensive list of backend supported by
matplotlib is present in website (http://matplotlib.org/faq/usage_faq.html).
ScikitLearn Supports both supervised and unsupervised learning algorithms.
In supervised learning technique it supports Generalized Linear Models, Ridge Regression,
Lasso, Orthogonal Matching pursuit, Bayesian Regression, Logistic Regression, Stochastic
Gradient Descent, Support Vector Machines etc. Comprehensive list of techniques and its
variations can be found in the Scikit learn Documentation link (http://scikitlearn.org/stable/tutorial/index.html ).
ScikitLearn is mostly used for the following tasks.
1) Classification
2) Clustering
3) Dimensionality Reduction
4) Cross Validation
5) Preprocessing.
Classification:
Classification is a general process related to categorization, the process in which ideas and
objects are recognized, differentiated, and understood. A classification system is an approach
to accomplishing classification.
In the terminology of machine learning, classification is considered an instance of supervised
learning, i.e. learning where a training set of correctly identified observations is available. The
corresponding unsupervised procedure is known as clustering, and involves grouping data into
categories based on some measure of inherent similarity or distance.
Clustering: Clustering is the process of grouping the objects in such a way that the objects with
in the same group are similar and the objects in different groups are dissimilar. It is generally
used for samples which does not have any labels associated with it but want to classify the
given data into set of categories. The number clusters that the given dataset can be divided is
still debatable and depends on the type of problems we want to solve.
Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be
achieved by various algorithms that differ significantly in their notion of what constitutes a
cluster and how to efficiently find them.
Dimensionality Reduction:
In machine learning and statistics, dimensionality reduction or dimension reduction is the process
of reducing the number of random variables under consideration, and can be divided into feature
selection and feature extraction.
Feature extraction transforms the data in the high-dimensional space to a space of
fewer dimensions. The data transformation may be linear, as in principal component
analysis(PCA), but many nonlinear dimensionality reduction techniques also exist. For
multidimensional data, tensor representation can be used in dimensionality reduction through
multilinear subspace learning.
Cross-Validation:
Cross-validation, sometimes called rotation estimation, is a model validation technique for
assessing how the results of a statistical analysis will generalize to an independent data set. It is
mainly used in settings where the goal is prediction, and one wants to estimate how accurately a
predictive model will perform in practice. In a prediction problem, a model is usually given a
dataset of known data on which training is run (training dataset), and a dataset of unknown
data (or first seen data) against which the model is tested (testing dataset). The goal of cross
validation is to define a dataset to "test" the model in the training phase (i.e., the validation
dataset), in order to limit problems like overfitting, give an insight on how the model will
generalize to an independent dataset.
One round of cross-validation involves partitioning a sample of data into complementary subsets,
performing the analysis on one subset (called the training set), and validating the analysis on the
other subset (called the validation set or testing set). To reduce variability, multiple rounds of
cross-validation are performed using different partitions, and the validation results are averaged
over the rounds.
Preprocessing:
The sklearn.preprocessing package provides several common utility functions and transformer
classes to change raw feature vectors into a representation that is more suitable for the
downstream estimators.
Standardization of datasets is a common requirement for many machine learning
estimators implemented in the scikit. They might behave badly if the individual feature do not
more or less look like standard normally distributed data: Gaussian with zero mean and unit
variance.
Sample Code for Classification:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis]
diabetes_X_temp = diabetes_X[:, :, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X_temp[:-20]
diabetes_X_test = diabetes_X_temp[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test,
diabetes_y_test))
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue',
linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
Output:
Sample Code for Clustering:
import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.datasets.samples_generator import make_blobs
##########################################################################
#####
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, _ = make_blobs(n_samples=10000, centers=centers, cluster_std=0.6)
##########################################################################
#####
# Compute clustering with MeanShift
# The following bandwidth can be automatically detected using
bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=500)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
print("number of estimated clusters : %d" % n_clusters_)
##########################################################################
#####
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle
plt.figure(1)
plt.clf()
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
my_members = labels == k
cluster_center = cluster_centers[k]
plt.plot(X[my_members, 0], X[my_members, 1], col + '.')
plt.plot(cluster_center[0], cluster_center[1], 'o',
markerfacecolor=col,
markeredgecolor='k', markersize=14)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
Output:
Sample Code for Dimensionality Reduction:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.lda import LDA
iris = datasets.load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)
lda = LDA(n_components=2)
X_r2 = lda.fit(X, y).transform(X)
# Percentage of variance explained for each components
print('explained variance ratio (first two components): %s'
% str(pca.explained_variance_ratio_))
plt.figure()
for c, i, target_name in zip("rgb", [0, 1, 2], target_names):
plt.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, label=target_name)
plt.legend()
plt.title('PCA of IRIS dataset')
plt.figure()
for c, i, target_name in zip("rgb", [0, 1, 2], target_names):
plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], c=c, label=target_name)
plt.legend()
plt.title('LDA of IRIS dataset')
plt.show()
Output:
Sample Code for Cross Validation:
K-Fold:
import numpy as np
from sklearn.cross_validation import KFold
kf = KFold(4, n_folds=2)
for train, test in kf:
print("%s %s" % (train, test))
Stratified K-Fold:
from sklearn.cross_validation import StratifiedKFold
labels = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
for train, test in skf:
print("%s %s" % (train, test))
Sample Code Using SVM that predicts digits in an image
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn import svm
digits = datasets.load_digits()
print(digits.data)
print(digits.target)
print(digits.images[0])
dlf= svm.SVC(gamma=0.001,C=100)
print(len(digits.data))
x,y = digits.data[:-1], digits.target[:-1]
dlf.fit(x,y)
print("Prediction:",dlf.predict(digits.data[-1]))
plt.imshow(digits.images[-1],cmap=plt.cm.gray_r,interpolation="nearest")
plt.show()
output:
('Prediction:', array([8]))
Download