Uploaded by 23 Abhinav Reddy Vanga

UNIT V.pptx

advertisement
Unit-V
Machine Learning Architectures
Machine learning architecture that is able to
preprocess the input data, decompose/augment
it, classify/cluster it, and, eventually, show the
results, using graphical tools.
• Data collection, preprocessing, and
augmentation
• Normalization, regularization, and
dimensionality reduction
• Vectorized computation and GPU support
• Distributed architectures
• Pipelines and feature unions
• A machine learning engineer often has to design a full
architecture that a layman would consider to be like a
black box, where the raw data enters and the
outcomes are automatically produced.
• All the steps necessary to achieve the final goal must
be correctly organized and seamlessly joined together
in a processing chain similar to a computational graph.
• However, there are some common steps that are
normally included in almost any ML pipeline.
• In the following diagram, there's a schematic
representation of this process:
Machine Learning Architecture
Modeling/grid search/cross-validation
• Modeling
implies
the
choice
of
the
classification/clustering algorithm that best suits
every specific task.
• Success of a machine learning technique often
depends on the right choice for each parameter
involved in the model as well.
• Data augmentation, it's very difficult to find a
precise method to determine the optimal values
to assign, and the best approach is always based
on a grid search.
• scikit-learn provides a very flexible mechanism to
investigate the performance of a model with
different parameter combinations, together with
cross-validation, it is a more reasonable approach,
even for expert engineers.
Grid search
• Grid-searching is the process of scanning the data to
configure optimal parameters for a given model. Depending
on the type of model utilized, certain parameters are
necessary.
• Grid-searching can be applied across machine learning to
calculate the best parameters to use for any given model. It
is important to note that Grid-searching can be extremely
computationally expensive and may take your machine
quite a long time to run.
• Grid-Search will build a model on each parameter
combination possible. It iterates through every parameter
combination and stores a model for each combination.
Visualization
• Sometimes, it's useful/necessary to visualize the
results of intermediate and final steps.
• Plots and diagrams using matplotlib, which is part
of SciPy and provides a flexible and powerful
graphics infrastructure.
• Many new projects are being developed, offering
new and more stylish plotting functions. One of
them is Bokeh (http://bokeh.pydata.org), which
works using some JavaScript code to create
interactive graphs that can be embedded into web
pages too.
GPU support
• Frameworks such as scikit-learn are mainly
based on NumPy, which contains highly
optimized native code that runs on multi-core
CPUs. Moreover, NumPy works with Single
Instruction Multiple Data (SIMD) commands
and exposes vectorized primitives. This
feature, in many cases, allows getting rid of
explicit loops, speeding up at the same time
the execution time.
A brief introduction to distributed
architectures
• Many real-life problems can be solved using
single machines, with enough computational
power; however, in some cases, the amount of
data is so large that it is impossible to perform
in-memory operations.
• User product matrix can become extremely
large, and the only way to solve the problem is
to employ distributed architectures. A generic
schema is shown in the following diagram:
• In architecture, there are generally two main components:
a master and some computational nodes. Once the model
has been designed and deployed, the master starts
analyzing it to determine the optimal physical distribution
of the jobs.
• The first step is loading the training and validation
datasets, which we assume to be too large to fit into the
memory of a single node. Hence, the master will inform
each node to load only a part and work with it until a new
command is received.
• At this point, each node is ready to start its computation,
which is always local (that is, the node A doesn't know
anything about its peers). Once finished, the results are
sent back to the Master, which will run further operations
If the process is iterative (as it often happens in machine
learning), the master is also responsible to check for the
target value (accuracy, loss, or any other indicator), share
the new parameters to all the nodes, and restart the
computations.
Scikit-learn tools for machine learning
architectures
• Two very important scikit-learn classes that
can help the machine learning engineer to
create complex processing
• Structures, including all the steps needed to
generate the desired outcomes from the raw
datasets.
• Pipelines
• Feature Unions
Pipelines
• Scikit-learn provides a flexible mechanism for
creating pipelines made up of subsequent
processing steps.
• The
Pipeline class accepts
a
single
steps parameter, which is a list of tuples in the
form (name of the component—instance), and
creates a complex object with the standard
fit/transform interface
• For example, if we need to apply a PCA, a
standard scaling, and then we want to classify
using an SVM, we could create a pipeline in the
following way:.
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import
StandardScaler
from sklearn.svm import SVC
pca = PCA(n_components=10)
scaler = StandardScaler()
svc = SVC(kernel='poly', gamma=3)
steps = [
('pca', pca),
('scaler', scaler),
('classifier', svc)
]
pipeline = Pipeline(steps)
• At this point, the pipeline can be fitted like a
single classifier (using the standard fit() and
fit_transform() methods), even if the input
samples are first passed to the PCA instance,
the reduced dataset is normalized by the
StandardScaler instance, and, finally, the
resultant samples are passed to the classifier.
• A pipeline is also very useful together with
GridSearchCV,
to
evaluate
different
combinations of parameters, not limited to a
single step but considering the whole process.
Feature unions
Another interesting class provided by scikit-learn
is the FeatureUnion class, which allows
concatenating different feature transformations
into a single output matrix. The main difference
with a pipeline (which can also include a feature
union) is that the pipeline selects from
alternative scenarios, while a feature union
creates a unified dataset where different
preprocessing outcomes are joined together.
• Example, considering the previous results, we
could try to optimize our dataset by
performing a PCA with 10 components joined
with the selection of the best 5 features
chosen according to the ANOVA metric. In this
way, the dimensionality is reduced to 15
instead of 20:
• from sklearn.pipeline import FeatureUnion
steps_fu =
[
]
('pca', PCA(n_components=10)),
('kbest', SelectKBest(f_classif, k=5)),
fu = FeatureUnion(steps_fu)
svc = SVC(kernel='rbf', C=5.0, gamma=0.05)
pipeline_steps = [
('fu', fu),
('scaler', scaler),
('classifier', svc)
]
pipeline = Pipeline(pipeline_steps)
• Performing a cross-validation, we get the
following:
from sklearn.model_selection import cross_val_score
print(cross_val_score(pipeline, digits.data, digits.target,
cv=10).mean())
0.965464333604
• The score is slightly lower than before (<
0.002), but the number of features has been
considerably reduced, and therefore also the
computational time.
• Joining the outputs of different data
preprocessors is a form of data augmentation,
and it must always be taken into account
when the original number of features is too
high or redundant/noisy and a single
decomposition method doesn't succeed in
capturing all the dynamics.
Download