Scikit Learn Installation: It can be easily installed using Anaconda Python (https://www.continuum.io/downloads ). If you have already installed python and want to install Scikit Learn then you can install it from scikit learn Website (https://www.continuum.io/downloads ). The easiest way is to use Anaconda. If you want to install without using Anaconda then you have to use pip command pip install -U scikit-learn. I faced few issues installing scikit in CentOS 6.6. It was unable to install the backend GTKAgg, GTK3Agg, GTK, GTKCairo. The comprehensive list of backend supported by matplotlib is present in website (http://matplotlib.org/faq/usage_faq.html). ScikitLearn Supports both supervised and unsupervised learning algorithms. In supervised learning technique it supports Generalized Linear Models, Ridge Regression, Lasso, Orthogonal Matching pursuit, Bayesian Regression, Logistic Regression, Stochastic Gradient Descent, Support Vector Machines etc. Comprehensive list of techniques and its variations can be found in the Scikit learn Documentation link (http://scikitlearn.org/stable/tutorial/index.html ). ScikitLearn is mostly used for the following tasks. 1) Classification 2) Clustering 3) Dimensionality Reduction 4) Cross Validation 5) Preprocessing. Classification: Classification is a general process related to categorization, the process in which ideas and objects are recognized, differentiated, and understood. A classification system is an approach to accomplishing classification. In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance. Clustering: Clustering is the process of grouping the objects in such a way that the objects with in the same group are similar and the objects in different groups are dissimilar. It is generally used for samples which does not have any labels associated with it but want to classify the given data into set of categories. The number clusters that the given dataset can be divided is still debatable and depends on the type of problems we want to solve. Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Dimensionality Reduction: In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction. Feature extraction transforms the data in the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in principal component analysis(PCA), but many nonlinear dimensionality reduction techniques also exist. For multidimensional data, tensor representation can be used in dimensionality reduction through multilinear subspace learning. Cross-Validation: Cross-validation, sometimes called rotation estimation, is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (testing dataset). The goal of cross validation is to define a dataset to "test" the model in the training phase (i.e., the validation dataset), in order to limit problems like overfitting, give an insight on how the model will generalize to an independent dataset. One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds. Preprocessing: The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators. Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit. They might behave badly if the individual feature do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance. Sample Code for Classification: import matplotlib.pyplot as plt import numpy as np from sklearn import datasets, linear_model # Load the diabetes dataset diabetes = datasets.load_diabetes() # Use only one feature diabetes_X = diabetes.data[:, np.newaxis] diabetes_X_temp = diabetes_X[:, :, 2] # Split the data into training/testing sets diabetes_X_train = diabetes_X_temp[:-20] diabetes_X_test = diabetes_X_temp[-20:] # Split the targets into training/testing sets diabetes_y_train = diabetes.target[:-20] diabetes_y_test = diabetes.target[-20:] # Create linear regression object regr = linear_model.LinearRegression() # Train the model using the training sets regr.fit(diabetes_X_train, diabetes_y_train) # The coefficients print('Coefficients: \n', regr.coef_) # The mean square error print("Residual sum of squares: %.2f" % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2)) # Explained variance score: 1 is perfect prediction print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test)) # Plot outputs plt.scatter(diabetes_X_test, diabetes_y_test, color='black') plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue', linewidth=3) plt.xticks(()) plt.yticks(()) plt.show() Output: Sample Code for Clustering: import numpy as np from sklearn.cluster import MeanShift, estimate_bandwidth from sklearn.datasets.samples_generator import make_blobs ########################################################################## ##### # Generate sample data centers = [[1, 1], [-1, -1], [1, -1]] X, _ = make_blobs(n_samples=10000, centers=centers, cluster_std=0.6) ########################################################################## ##### # Compute clustering with MeanShift # The following bandwidth can be automatically detected using bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=500) ms = MeanShift(bandwidth=bandwidth, bin_seeding=True) ms.fit(X) labels = ms.labels_ cluster_centers = ms.cluster_centers_ labels_unique = np.unique(labels) n_clusters_ = len(labels_unique) print("number of estimated clusters : %d" % n_clusters_) ########################################################################## ##### # Plot result import matplotlib.pyplot as plt from itertools import cycle plt.figure(1) plt.clf() colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk') for k, col in zip(range(n_clusters_), colors): my_members = labels == k cluster_center = cluster_centers[k] plt.plot(X[my_members, 0], X[my_members, 1], col + '.') plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=14) plt.title('Estimated number of clusters: %d' % n_clusters_) plt.show() Output: Sample Code for Dimensionality Reduction: import matplotlib.pyplot as plt from sklearn import datasets from sklearn.decomposition import PCA from sklearn.lda import LDA iris = datasets.load_iris() X = iris.data y = iris.target target_names = iris.target_names pca = PCA(n_components=2) X_r = pca.fit(X).transform(X) lda = LDA(n_components=2) X_r2 = lda.fit(X, y).transform(X) # Percentage of variance explained for each components print('explained variance ratio (first two components): %s' % str(pca.explained_variance_ratio_)) plt.figure() for c, i, target_name in zip("rgb", [0, 1, 2], target_names): plt.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, label=target_name) plt.legend() plt.title('PCA of IRIS dataset') plt.figure() for c, i, target_name in zip("rgb", [0, 1, 2], target_names): plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], c=c, label=target_name) plt.legend() plt.title('LDA of IRIS dataset') plt.show() Output: Sample Code for Cross Validation: K-Fold: import numpy as np from sklearn.cross_validation import KFold kf = KFold(4, n_folds=2) for train, test in kf: print("%s %s" % (train, test)) Stratified K-Fold: from sklearn.cross_validation import StratifiedKFold labels = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1] for train, test in skf: print("%s %s" % (train, test)) Sample Code Using SVM that predicts digits in an image import matplotlib.pyplot as plt from sklearn import datasets from sklearn import svm digits = datasets.load_digits() print(digits.data) print(digits.target) print(digits.images[0]) dlf= svm.SVC(gamma=0.001,C=100) print(len(digits.data)) x,y = digits.data[:-1], digits.target[:-1] dlf.fit(x,y) print("Prediction:",dlf.predict(digits.data[-1])) plt.imshow(digits.images[-1],cmap=plt.cm.gray_r,interpolation="nearest") plt.show() output: ('Prediction:', array([8]))