Unit-V Machine Learning Architectures Machine learning architecture that is able to preprocess the input data, decompose/augment it, classify/cluster it, and, eventually, show the results, using graphical tools. • Data collection, preprocessing, and augmentation • Normalization, regularization, and dimensionality reduction • Vectorized computation and GPU support • Distributed architectures • Pipelines and feature unions • A machine learning engineer often has to design a full architecture that a layman would consider to be like a black box, where the raw data enters and the outcomes are automatically produced. • All the steps necessary to achieve the final goal must be correctly organized and seamlessly joined together in a processing chain similar to a computational graph. • However, there are some common steps that are normally included in almost any ML pipeline. • In the following diagram, there's a schematic representation of this process: Machine Learning Architecture Modeling/grid search/cross-validation • Modeling implies the choice of the classification/clustering algorithm that best suits every specific task. • Success of a machine learning technique often depends on the right choice for each parameter involved in the model as well. • Data augmentation, it's very difficult to find a precise method to determine the optimal values to assign, and the best approach is always based on a grid search. • scikit-learn provides a very flexible mechanism to investigate the performance of a model with different parameter combinations, together with cross-validation, it is a more reasonable approach, even for expert engineers. Grid search • Grid-searching is the process of scanning the data to configure optimal parameters for a given model. Depending on the type of model utilized, certain parameters are necessary. • Grid-searching can be applied across machine learning to calculate the best parameters to use for any given model. It is important to note that Grid-searching can be extremely computationally expensive and may take your machine quite a long time to run. • Grid-Search will build a model on each parameter combination possible. It iterates through every parameter combination and stores a model for each combination. Visualization • Sometimes, it's useful/necessary to visualize the results of intermediate and final steps. • Plots and diagrams using matplotlib, which is part of SciPy and provides a flexible and powerful graphics infrastructure. • Many new projects are being developed, offering new and more stylish plotting functions. One of them is Bokeh (http://bokeh.pydata.org), which works using some JavaScript code to create interactive graphs that can be embedded into web pages too. GPU support • Frameworks such as scikit-learn are mainly based on NumPy, which contains highly optimized native code that runs on multi-core CPUs. Moreover, NumPy works with Single Instruction Multiple Data (SIMD) commands and exposes vectorized primitives. This feature, in many cases, allows getting rid of explicit loops, speeding up at the same time the execution time. A brief introduction to distributed architectures • Many real-life problems can be solved using single machines, with enough computational power; however, in some cases, the amount of data is so large that it is impossible to perform in-memory operations. • User product matrix can become extremely large, and the only way to solve the problem is to employ distributed architectures. A generic schema is shown in the following diagram: • In architecture, there are generally two main components: a master and some computational nodes. Once the model has been designed and deployed, the master starts analyzing it to determine the optimal physical distribution of the jobs. • The first step is loading the training and validation datasets, which we assume to be too large to fit into the memory of a single node. Hence, the master will inform each node to load only a part and work with it until a new command is received. • At this point, each node is ready to start its computation, which is always local (that is, the node A doesn't know anything about its peers). Once finished, the results are sent back to the Master, which will run further operations If the process is iterative (as it often happens in machine learning), the master is also responsible to check for the target value (accuracy, loss, or any other indicator), share the new parameters to all the nodes, and restart the computations. Scikit-learn tools for machine learning architectures • Two very important scikit-learn classes that can help the machine learning engineer to create complex processing • Structures, including all the steps needed to generate the desired outcomes from the raw datasets. • Pipelines • Feature Unions Pipelines • Scikit-learn provides a flexible mechanism for creating pipelines made up of subsequent processing steps. • The Pipeline class accepts a single steps parameter, which is a list of tuples in the form (name of the component—instance), and creates a complex object with the standard fit/transform interface • For example, if we need to apply a PCA, a standard scaling, and then we want to classify using an SVM, we could create a pipeline in the following way:. from sklearn.decomposition import PCA from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC pca = PCA(n_components=10) scaler = StandardScaler() svc = SVC(kernel='poly', gamma=3) steps = [ ('pca', pca), ('scaler', scaler), ('classifier', svc) ] pipeline = Pipeline(steps) • At this point, the pipeline can be fitted like a single classifier (using the standard fit() and fit_transform() methods), even if the input samples are first passed to the PCA instance, the reduced dataset is normalized by the StandardScaler instance, and, finally, the resultant samples are passed to the classifier. • A pipeline is also very useful together with GridSearchCV, to evaluate different combinations of parameters, not limited to a single step but considering the whole process. Feature unions Another interesting class provided by scikit-learn is the FeatureUnion class, which allows concatenating different feature transformations into a single output matrix. The main difference with a pipeline (which can also include a feature union) is that the pipeline selects from alternative scenarios, while a feature union creates a unified dataset where different preprocessing outcomes are joined together. • Example, considering the previous results, we could try to optimize our dataset by performing a PCA with 10 components joined with the selection of the best 5 features chosen according to the ANOVA metric. In this way, the dimensionality is reduced to 15 instead of 20: • from sklearn.pipeline import FeatureUnion steps_fu = [ ] ('pca', PCA(n_components=10)), ('kbest', SelectKBest(f_classif, k=5)), fu = FeatureUnion(steps_fu) svc = SVC(kernel='rbf', C=5.0, gamma=0.05) pipeline_steps = [ ('fu', fu), ('scaler', scaler), ('classifier', svc) ] pipeline = Pipeline(pipeline_steps) • Performing a cross-validation, we get the following: from sklearn.model_selection import cross_val_score print(cross_val_score(pipeline, digits.data, digits.target, cv=10).mean()) 0.965464333604 • The score is slightly lower than before (< 0.002), but the number of features has been considerably reduced, and therefore also the computational time. • Joining the outputs of different data preprocessors is a form of data augmentation, and it must always be taken into account when the original number of features is too high or redundant/noisy and a single decomposition method doesn't succeed in capturing all the dynamics.