Uploaded by Wan Yin

What deep learning really means

advertisement
What deep learning really means
GPUs in the cloud put the predictive power of deep
neural networks within reach of every developer
FEB 6, 2017
Perhaps the most positive technical theme of 2016 was the long-delayed triumph of
artificial intelligence, machine learning, and in particular deep learning. In this article
we'll discuss what that means and how you might make use of deep learning yourself.
Perhaps you noticed in the fall of 2016 that Google Translate suddenly went from
producing, on the average, word salad with a vague connection to the original language to
emitting polished, coherent sentences more often than not -- at least for supported
language pairs, such as English-French, English-Chinese, and English-Japanese. That
dramatic improvement was the result of a nine-month concerted effort by the Google
Brain and Google Translate teams to revamp Translate from using its old phrase-based
statistical machine translation algorithms to working with a neural network trained with
deep learning and word embeddings employing Google's TensorFlow framework.
[ Find out which machine learning and deep learning frameworks are for you with
our Test Center comparison of TensorFlow, Spark MLlib, Scikit-learn, MXNet,
Microsoft Cognitive Toolkit, and Caffe. | Jump into Microsoft’s drag-and-drop
machine learning studio: Get started with Azure Machine Learning. | The
InfoWorld review roundup: AWS, Microsoft, Databricks, Google, HPE, and IBM
machine learning in the cloud. ]
Was that magic? No, not at all: It wasn't even easy. The researchers working on the
conversion had access to a huge corpus of translations from which to train their networks,
but they soon discovered that they needed thousands of GPUs for training and would
have to create a new kind of chip, a Tensor Processing Unit (TPU), to run Translate on
their trained neural networks at scale. They also had to refine their networks hundreds of
times as they tried to train a system that would be nearly as good as human translators.
Do you need to be Google scale to take advantage of deep learning? Thanks to cloud
offerings, the answer is an emphatic no. Not only can you run cloud VM and container
instances with many CPU cores and large amounts of RAM, you can get access to GPUs,
as well as prebuilt images that include deep learning software.
Conventional programming
To grasp how deep learning works, you’ll need to understand a bit about machine
learning and neural networks, which in effect are themselves defined by how they differ
from conventional programming.
Conventional programming involves writing specific instructions for the computer to
execute. For example, take the classic "Hello, World" program in the C programming
language:
/* Hello World program */
#include <stdio.h>
main()
{
printf("Hello, World");
}
This program, when compiled and linked, does one thing: It prints the string "Hello,
World" on the standard output port. It does only what the programmer told it to do, and it
does the same thing every time it runs.
You may wonder how game programs sometimes give different outputs from the same
inputs, such as swinging your character's ax at a dragon. That requires the use of a
random number generator and a program that performs different actions based on the
number returned by the generator:
BOOL Swing_ax_at_dragon()
{
BOOL retval = rand()>SOME_THRESHOLD;
if (retval)
printf("Your ax hits. Dragon dies.");
else
printf("Your ax misses. Dragon spits flames.");
return retval;
}
In other words, if we want a conventional program to vary statistically instead of
behaving consistently, we have to program the variation. Machine learning turns that idea
on its head.
Machine learning
In machine learning (ML), the essential task is to create a predictor of future outputs from
some set of inputs. This is accomplished by training the predictor statistically from
historical data.
If the value predicted is a real number, then you are solving a regressionproblem, such as
"What will the price of MSFT stock be on Tuesday at noon?" The complete history of
MSFT stock transactions is available for training, as are all the related stocks, news, and
economic data that might correlate to the stock price.
If you are predicting a yes or no response, then you are solving a binary or two-class
classification problem, such as "Will the price of MSFT stock go up between now and
Tuesday at noon?" The corpus of data is the same as the regression problem, but the
algorithms for optimizing the predictor will be different.
If you are predicting more than two classes, then you are solving a multiclass
classification problem, such as "What's the best action for MSFT stock? Buy, sell, or
hold?" Again, the corpus of data is the same, but the algorithms might be different.
In general, when you do ML you first prepare the historical data (see my tutorial on
Azure ML for an example), then split it randomly into two groups: one for training and
one for testing. When you process the data for training, you use the known target value;
when you process the data for testing, you predict the target value from the other data (no
peeking!) and compute the error rates by comparing the prediction to the known target
value.
Microsoft's Machine Learning Algorithm Cheat Sheet shown above is a good resource
for picking algorithms, especially if you're using Azure ML or another general-purpose
ML library or service. For the case of stock market data, Decision Forest (known for
accuracy and fast training) might be a good first algorithm for regression, Logistic
Regression (fast training, linear model) might be a good first algorithm for two-class
classification, and Decision Jungle (accuracy, small memory footprint) might be a good
first algorithm for multiclass classification.
By the way, the only way to find the best algorithm is to try them all. Some ML packages
and services, such as Spark.ML, can parallelize that for you and help pick the best result.
Note that neural networks are an option for any of the three kinds of prediction problems.
Also note that neural networks are known both for accuracy and long training times. So
what are neural networks, other than one of the more time-consuming but accurate
approaches to machine learning?
Neural networks
The ideas for neural networks go back to the 1940s. The essential concept is that a
network of artificial neurons built out of interconnected threshold switches can learn to
recognize patterns in the same way that an animal brain and nervous system (including
the retina) does.
The learning occurs basically by strengthening the connection between two neurons when
both are active at the same time during training; in modern neural network software this
is most commonly a matter of increasing the weight values for the connections between
neurons using a rule called back propagation of error, backprop, or BP.
How are the neurons modeled? Each has a propagation function that transforms the
outputs of the connected neurons, often with a weighted sum. The output of the
propagation function passes to an activation function, which fires when its input exceeds
a threshold value.
In the 1940s and '50s artificial neurons used a step activation function and were
called perceptrons. Modern neural networks may reference perceptrons, but actually have
smooth activation functions, such as the logistic or sigmoid function, the hyperbolic
tangent, and the Rectified Linear Unit (ReLU). ReLU is usually the best choice for fast
convergence, although it has an issue of neurons "dying" during training if the learning
rate is set too high.
The output of the activation function can pass to an output function for additional
shaping. Often, however, the output function is the identity function, meaning that the
output of the activation function is passed to the downstream connected neurons.
Now that we know about the neurons, we need to learn about the common neural network
topologies. In a feed-forward network, the neurons are organized into distinct layers: one
input layer, N hidden processing layers, and one output layer, and the outputs from each
layer go only to the next layer.
In a feed-forward network with shortcut connections, some connections can jump over
one or more intermediate layers. In recurrent neural networks, neurons can influence
themselves, either directly, or indirectly through the next layer.
Supervised learning of a neural network is done exactly like any other machine learning:
You present the network with groups of training data, compare the network output with
the desired output, generate an error vector, and apply corrections to the network based
on the error vector. Batches of training data that are run together before applying
corrections are called epochs.
For those interested in the details, back propagation uses the gradient of the error (or cost)
function with respect to the weights and biases of the model to discover the correct
direction to minimize the error. Two things control the application of corrections: the
optimization algorithm and the learning rate variable, which usually needs to be small to
guarantee convergence and avoid causing dead ReLU neurons.
Optimizers for neural networks typically use some form of gradient descent algorithm to
drive the back propagation, often with a mechanism to help avoid becoming stuck in
local minima, such as optimizing randomly selected minibatches (Stochastic Gradient
Descent) and applying momentum corrections to the gradient. Some optimization
algorithms also adapt the learning rates of the model parameters by looking at the
gradient history (AdaGrad, RMSProp, and Adam).
As with all machine learning, you need to check the predictions of the neural network
against a separate test data set. Without doing that, you risk creating neural networks that
only memorize their inputs instead of learning to be generalized predictors.
Deep learning
Now that you know something about machine learning and neural networks, it's only a
small step to understanding the nature of deep learning algorithms.
The dominant deep learning algorithms are deep neural networks (DNNs), which are
neural networks constructed from many layers (hence the term "deep") of alternating
linear and nonlinear processing units, and are trained using large-scale algorithms and
massive amounts of training data. A deep neural network might have 10 to 20 hidden
layers, whereas a typical neural network may have only a few.
The more layers in the network, the more characteristics it can recognize. Unfortunately,
the more layers in the network, the longer it will take to calculate, and the harder it will
be to train.
Another kind of deep learning algorithm is Random Decision Forests (RDFs). Again,
they are constructed from many layers, but instead of neurons the RDF is constructed
from decision trees and outputs a statistical average (mode or mean) of the predictions of
the individual trees. The randomized aspects of RDFs are the use of bootstrap
aggregation (bagging) for individual trees and taking random subsets of the features.
Understanding why deep learning algorithms work is nontrivial. I won't say
that nobody knows why they work, since there have been papers on the subject, but I will
say there doesn't seem to be widespread consensus about why they work or how best to
construct them.
The Google Brain people creating the deep neural network for the new Google Translate
didn't know ahead of time what algorithms would work. They had to iterate and run many
weeklong experiments to make their network better, but sometimes hit dead ends and had
to backtrack. (According to the New York Times article cited earlier, "One day a model,
for no apparent reason, started taking all the numbers it came across in a sentence and
discarding them." Oops.)
There are many ways to approach deep learning, but none are perfect, at least not yet.
There are only better and worse strategies for each application.
Deep learning strategies, tactics, and applications
For an example of an application of deep learning, let's take image recognition. Since
living organisms process images with their visual cortex, many researchers have taken
the architecture of the visual cortex as a model for neural networks designed to perform
image recognition. The biological research goes back to the 1950s.
The breakthrough in the neural network field for vision was Yann LeCun's 1998 LeNet-5,
a seven-level convolutional neural network (CNN) for recognition of handwritten digits
digitized in 32-by-32-pixel images. To analyze higher-resolution images, the network
would need more neurons and more layers.
Since then, packages for creating CNNs and other deep neural networks have
proliferated. These include Caffe, Microsoft Cognitive
Toolkit,MXNet, Neon, TensorFlow, Theano, and Torch.
Convolutional neural networks typically use convolutional, pooling, ReLU, fully
connected, and loss layers to simulate a visual cortex. The convolutional layer basically
takes the integrals of many small overlapping regions. The pooling layer performs a form
of nonlinear down-sampling. ReLU layers, which we mentioned earlier, apply the
nonsaturating activation function f(x) = max(0,x). In a fully connected layer, the neurons
have full connections to all activations in the previous layer. A loss layer computes how
the network training penalizes the deviation between the predicted and true labels, using a
Softmax or cross-entropy loss for classification or an Euclidean loss for regression.
Besides image recognition, CNNs have been applied to natural language processing, drug
discovery, and playing Go.
Natural language processing (NLP) is another major application area for deep learning. In
addition to the machine translation problem addressed by Google Translate, major NLP
tasks include automatic summarization, co-reference resolution, discourse analysis,
morphological segmentation, named entity recognition, natural language generation,
natural language understanding, part-of-speech tagging, sentiment analysis, and speech
recognition.
In addition to CNNs, NLP tasks are often addressed with recurrent neural networks
(RNNs), which include the Long Short Term Memory (LSTM) model. As I mentioned
earlier, in recurrent neural networks, neurons can influence themselves, either directly, or
indirectly through the next layer. In other words, RNNs can have loops, which gives them
the ability to persist some information history when processing sequences -- and language
is nothing without sequences. LSTMs are a particularly attractive form of RNN that have
a more powerful update equation and a more complicated repeating module structure.
Running deep learning
Needless to say, deep CNNs and LSTMs often require serious computing power for
training. Remember how the Google Brain team needed a couple thousand GPUs to train
the new A.I. version of Google Translate? That's no joke. A training session that takes
three hours on one GPU is likely to take 30 hours on a CPU. Also, the kind of GPU
matters: For most deep learning packages, you need one or more CUDA-compatible
Nvidia GPUs with enough internal memory to run your models.
That may mean you'll want to run your training in the cloud: AWS, Azure, and Bluemix
all offer instances with GPUs as of this writing, as will Google early in 2017.
While the biggest cloud GPU instances can cost $14 per hour to run, there are less
expensive alternatives. An AWS instance with a single GPU can cost less than $1 per
hour to run, and the Azure Batch Shipyard and its deep learning recipes using the NC
series of GPU-enabled instances run your training in a compute pool, with the small NC6
instances going for 90 cents an hour.
Yes, you can and should install your deep learning package of choice on your own
computer for learning purposes, whether or not it has a suitable GPU. But when it comes
time to train models at scale, you probably won't want to limit yourself to the hardware
you happen to have on site.
For deeper learning
You can learn a lot about deep learning simply by installing one of the deep learning
packages, trying out its samples, and reading its tutorials. For more depth, consider one or
more of the following resources:






Neural Networks and Deep Learning, by Michael Nielsen
A Brief Introduction to Neural Networks, by David Kriesel
Deep Learning, by Yoshua Bengio, Ian Goodfellow, and Aaron Courville
A Course in Machine Learning, by Hal Daumé III
The TensorFlow Playground, by Daniel Smilkov and Shan Carter
Stanford CS class CS231n: Convolutional Neural Networks for Visual
Recognition
Related articles




Review: The best frameworks for machine learning and deep learning
Get started with Tensor Flow
Get started with Azure Machine Learning
Review: 6 machine learning clouds
Download