Uploaded by Ou Chang

neural network-Copy

advertisement
Search Medium
Write
How to classify MNIST digits with
different neural network
architectures
Getting started with neural networks and Keras
Tyler Elliot Bettilyon · Follow
Published in Teb’s Lab · 16 min read · Aug 8, 2018
728
5
Photo by Greg Rakozy on Unsplash
This is the third article in the series, and is associated with our Intro to Deep
Learning Github repository where you can find practical examples of many
deep learning applications and tactics. You can find the first article in the
series here, and the second article in the series here and the fourth article
here.
Please note: All of the code samples below can be found and run in this
Jupyter Notebook kindly hosted by Google Colaboratory. I encourage you to
copy the code, make changes, and experiment with the networks yourself as
you read this article.
Neural networks
Although neural networks have gained enormous popularity over the last few
years, for many data scientists and statisticians the whole family of models
has (at least) one major flaw: the results are hard to interpret. One of the
reasons that people treat neural networks as a black box is that the structure
of any given neural network is hard to think about.
Neural networks frequently have anywhere from hundreds of thousands to
millions of weights that are individually tuned during training to minimize
error. With so many variables interacting in complex ways, it is difficult to
describe exactly why one particular neural network outperforms some other
Sign up
Sign In
neural network. This complexity also makes it hard to design top-tier neural
network architectures.
Some machine learning terminology appears here, in case you haven’t seen it
before:
The name x refers to input data, while the name y refers to the labels. ลท
(pronounced y-hat) refers to the predictions made by a model.
Training data is the data our model learns from.
Test data is kept secret from the model until after it has been trained.
Test data is used to evaluate our model.
A loss function is a function to quantify how accurate a model’s
predictions were.
An optimization algorithm controls exactly how the weights of the
computational graph are adjusted during training
For a refresher about splitting up test and training data, or if this is new
information, consider reading this article.
MNIST handwritten digits dataset
In this article, we’re going to work through a series of simple neural network
architectures and compare their performance on the MNIST handwritten
digits dataset. The goal for all the networks we examine is the same: take an
input image (28x28 pixels) of a handwritten single digit (0–9) and classify the
image as the appropriate digit.
State of the art neural network approaches have achieved near-perfect
performance, classifying 99.8% of digits correctly from a left-out test set of
digits. This impressive performance has real world benefits as well. The US
Postal Service processes 493.4 million pieces of mail per day, and 1% of that
workload is 4.9 million pieces of mail. Accurate automation can prevent
postal workers from individually handling and examining millions of parcels
each day. Of course, automatically reading complete addresses isn’t as
simple as processing individual digits, but let’s learn to crawl before we try to
jog.
It’s always a good idea to familiarize yourself with a dataset before diving into
any machine learning task. Here are some examples of the images in the
dataset:
A random selection of MNIST digits. In the Jupyter Notebook you can view more random selections from the
dataset.
The MNIST dataset is a classic problem for getting started with neural
networks. I’ve heard a few people joke that it’s the deep learning version of
“hello world”— a lot of simple networks do a surprisingly good job with the
dataset, even though some of the digits are pretty tricky:
This image is from the wonderful book Neural Networks and Deep Learning, available online for free.
Preparing the data
The first and most important step in any machine learning task is to prepare
the data. For many scientists and industry practitioners, the process of
gathering, cleaning, labeling, and storing the data into a usable digital format
represents the lion’s share of the work. Additionally, any errors introduced
during this step will cause the learning algorithm to learn incorrect patterns.
As they say: garbage in, garbage out.
Thanks to the Keras library and the hard work of the National Institute of
Standards and Technology (the NIST of MNIST) the hardest parts have been
done for us. The data’s been collected and is already well-formatted for
processing. Therefore, it is with deep gratitude for NIST and the Keras
maintainers that our Python code for getting the data is simple:
Relevant XKCD — Python really is wonderful.
Once we have the dataset, we have to format it appropriately for our neural
network. This article is focused only on fully connected neural networks,
which means our input data must be a vector. Instead of several 28x28
images, we’ll have several vectors that are all length 784 (28*28=784). This
flattening process is not ideal — we obfuscate information about which pixels
are next to each other.
Our networks will overcome this loss of information, but it is worth mentioning
convolutional neural networks (CNNs). These are specifically designed for
image processing/computer vision, and maintain these spatial relationships.
In a future article, we will revisit MNIST with CNNs and compare our results.
Keras, again, provides a simple utility to help us flatten the 28x28 pixels into a
vector:
We have to do one last thing to this dataset before we’re ready to experiment
with some neural networks. The labels for this dataset are numerical values
from 0 to 9 — but it’s important that our algorithm treats these as items in a
set, rather than ordinal values. In our dataset the value “0” isn’t smaller than
the value “9”, they are just two different values from our set of possible
classifications.
If our algorithm predicts “8” when it should predict “0” it is wrong to say that
the model was “off by 8” — it simply predicted the wrong category. Similarly,
predicting “7” when we should have predicted “8” is not better than predicting
“0” when we should have predicted “8” — both are just wrong.
To address this issue, when we’re making predictions about categorical data
(as opposed to values from a continuous range), the best practice is to use a
“one-hot encoded” vector. This means that we create a vector as long as the
number of categories we have, and force the model to set exactly one of the
positions in the vector to 1 and the rest to 0 (the single 1 is the “hot” value
within the vector).
Thankfully, Keras makes this remarkably easy to do as well:
Finally, it is worth mentioning that there are a lot of other things we could do at
this point to normalize/improve the input images themselves. Preprocessing
is common (because it’s a good idea) but we’re going to ignore it for now. Our
focus is on examining neural network architectures.
Neural network architectures
For fully connected neural networks, there are three essential questions that
define the network’s architecture:
1. How many layers are there?
2. How many nodes are there in each of those layers?
3. What transfer/activation function is used at each of those layers?
This article explores the first two of these questions, while the third will be
explored in a later article. The behavior of the transfer/activation function is
closely related to gradient descent and backpropagation, so discussing the
available options will make more sense after the next article in this series.
All of the network architectures in this article use the sigmoid transfer
function for all of the hidden layers.
There are other factors that can contribute to the performance of a neural
network. These include which loss function is used, which optimization
algorithm is used, how many training epochs to run before stopping, and the
batch size within each epoch. Changes to batch size and epochs are
discussed here. But, to help us compare “apples to apples”, I have kept the
loss function and optimization algorithm fixed:
I’ve selected a common loss function called categorical cross entropy.
I’ve selected one of the simplest optimization algorithms: Stochastic
Gradient Descent (SGD).
Whew, now that all of that is out of the way, let’s build our very first network:
Building the network
All the networks in this article will have the same input layer and output layer.
We defined the input layer earlier as a vector with 784 entries — this is the
data from the flattened 28x28 image. The output layer was also implicitly
defined earlier when we created a one-hot encoded vector from the labels —
the ten labels correspond to the ten nodes in this layer.
Our output layer also uses a special activation function called softmax. This
normalizes the values from the ten output nodes such that:
all the values are between 0 and 1, and
the sum of all ten values is 1.
This allows us to treat those ten output values as probabilities, and the
largest one is selected as the prediction for the one-hot vector. In machine
learning, the softmax function is almost always used when our model’s output
is a one-hot encoded vector.
Finally, this model has a single hidden layer with 32 nodes using the sigmoid
activation function. The resulting architecture has 25,450 tunable
parameters. From the input layer to the hidden layer there are 784*32 =
25,088 weights. The hidden layer has 32 nodes so there are 32 biases. This
brings us to 25,088 + 32 = 25,120 parameters.
From the hidden layer to the output layer there are 32*10 = 320 weights.
Each of the ten nodes adds a single bias bringing us to 25,120 + 320 + 10 =
25,450 total parameters.
Keras has a handy method to help you calculate the number of parameters in
a model, calling the
.summary()
method we get:
Layer (type) Output Shape Param #
=================================================================
dense_203 (Dense) (None, 32) 25120
_________________________________________________________________
dense_204 (Dense) (None, 10) 330
=================================================================
Total params: 25,450
Trainable params: 25,450
Non-trainable params: 0
We can use Keras to train and evaluate this model as well:
Training and validation accuracy over time. Final test accuracy: 0.87.
Performance varies a little bit from run to run (give it a try in the Jupyter
notebook), but accuracy is consistently between 87–90%. This is an
incredible result. We have obfuscated spatial relationships within the data by
flattening the images. We have done zero feature extraction to help the model
understand the data. Yet, in under one minute of training on consumer grade
hardware, we’re already doing nearly nine times better than guessing
randomly.
Network depth and layer width
While there are some rules of thumb, the only way to determine the best
architecture for any particular task is empirically. Sometimes the “sensible
defaults” will work well, and other times they won’t work at all. The only way to
find out for sure if your neural network works on your data is to test it, and
measure your performance.
Neural network architecture is the subject of quite a lot of open research.
Finding a new architecture that outperforms existing architectures on a
particular task is often an achievement worthy of publication. It’s common for
practitioners to select an architecture based on a recent publication, and
either copy it wholesale for a new task or make minor tweaks to gain
incremental improvement.
Still, there is a lot to learn from reinventing some simple wheels from scratch.
Let’s examine a few alternatives to this small network and examine the
impact of those changes.
Network depth
The depth of a multi-layer perceptron (also know as a fully connected neural
network) is determined by its number of hidden layers. The network above
has one hidden layer. This network is so shallow that it’s technically
inaccurate to call it “deep learning”.
Let’s experiment with layers of different lengths to see how the depth of a
network impacts its performance. I have written a couple short functions to
help reduce boilerplate throughout this tutorial:
The
evaluate
function prints a summary of the model, trains the model,
graphs the training and validation accuracy, and prints a summary of its
performance on the test data. By default it does all this using the fixed
hyperparameters we’ve discussed, specifically:
stochastic gradient descent (SGD)
five training epochs
training batch size of 128
the categorical cross entropy loss function.
The
create_dense
function lets us pass in an array of sizes for the hidden
layers. It creates a multi-layer perceptron that always has appropriate input
and output layers for our MNIST task. Specifically, the models will have:
an input vector of length 784
an output vector of length ten that uses a one-hot encoding and the
softmax activation function
a number of layers with the widths specified by the input array all using
the sigmoid activation function.
This code uses these functions to create and evaluate several neural nets of
increasing depth, each with 32 nodes per hidden layer:
In Python: [32] * 2 => [32, 32] and [32] * 3 => [32, 32, 32], and so on...
Running this code produces some interesting charts, via the evaluate
function defined above:
One hidden layer, final test accuracy: 0.888
2 hidden layers, final test accuracy: 0.767
3 hidden layers, final test accuracy: 0.438
4 hidden layers, final test accuracy: 0.114
Overfitting
Adding more layers appears to have decreased the accuracy of the model.
That might not be intuitive — aren’t we giving the model greater flexibility
and therefore increasing its ability to make predictions? Unfortunately the
trade-off isn’t that simple.
One thing we should look for is overfitting. Neural networks are flexible
enough that they can adjust their parameters to fit the training data so
precisely that they no longer generalize to data from outside the training set
(for example, the test data). This is kind of like memorizing the answers to a
specific math test without learning how to actually do the math.
Overfitting is a problem with many machine learning tasks. Neural networks
are especially prone to overfitting because of the very large number of
tunable parameters. One sign that you might be overfitting is that the training
accuracy is significantly better than the test accuracy. But only one of our
results — the network with four hidden layers — has that feature. That
model’s accuracy is barely better than guessing at random, even during
training. Something more subtle is going on here.
In some ways a neural network is like a game of telephone — each layer only
gets information from the layer right before it. The more layers we add, the
more the original message is changed, which is sometimes a strength and
sometimes a weakness.
If the series of layers allow the build-up of useful information, then stacking
layers can cause higher levels of meaning to build up. One layer finds edges
in an image, another finds edges that make circles, another finds edges that
make lines, another finds combinations of circles and lines, and so on.
On the other hand, if the layers are destructively removing context and useful
information then, like in the game of telephone, the signal deteriorates as it
passes through the layers until all the valuable information is lost.
Imagine you had a hidden layer with only one node — this would force the
network to reduce all the interesting interactions so far into a single value,
then propagate that single value through the subsequent layers of the
network. Information is lost, and such a network would perform terribly.
Another useful way to think about this is in terms of image resolution —
originally we had a “resolution” of 784 pixels but we forced the neural network
to downsample very quickly to a “resolution” of 32 values. These values are no
longer pixels, but combinations of the pixels in the previous layer.
Compressing the resolution once was (evidently) not so bad. But, like
repeatedly saving a JPEG, repeated chains of “low resolution” data transfer
from one layer to the next can result in lower quality output.
Finally, because of the way backpropagation and optimization algorithms
work with neural networks, deeper networks require more training time. It
may be that our model’s 32 node-per-layer architecture just needs to train
longer.
If we let the three-layer network from above train for 40 epochs instead of
five, we get these results:
3 hidden layers, 40 training epochs instead of 5. Final test accuracy: .886
The only way to really know which of these factors is at play in your own
models is to design tests and experiment. Always keep in mind that many of
these factors can impact your model at the same time and to different
degrees.
Layer width
Another knob we can turn is the number of nodes in each hidden layer. This is
called the width of the layer. As with adding more layers, making each layer
wider increases the total number of tunable parameters. Making wider layers
tends to scale the number of parameters faster than adding more layers.
Every time we add a single node to layer i, we have to give that new node an
edge to every node in layer i+1.
Using the same
evaluate
and
create_dense
functions as above, let’s compare
a few neural networks with a single hidden layer using different layer widths.
Once again, running this code produces some interesting charts:
One hidden layer, 32 nodes. Final test accuracy: .886
One hidden layer, 64 nodes. Final test accuracy: .904
One hidden layer, 128 nodes. Final test accuracy: .916
One hidden layer, 256 nodes. Final test accuracy: .926
One hidden layer, 512 nodes. Final test accuracy: .934
One hidden layer, 2048 nodes. Final test accuracy: .950. This model has a hint of potential overfitting — notice
where the lines cross at the very end of our training period.
This time the performance changes are more intuitive — more nodes in the
hidden layer consistently mapped to better performance on the test data. Our
accuracy improved from ~87% with 32 nodes to ~95% with 2048 nodes. Not
only that, but the accuracy during our final round of training very nearly
predicted the accuracy on the test data — a sign that we are probably not
overfitting.
The cost for this improvement was additional training time. As the number of
tunable parameters ballooned from a meager 25,000 with 32 nodes to over
1.6 million with 2,048 nodes, so did training time. This caused our training
epochs to go from taking about one second each to about 10 seconds each
(on my Macbook Pro — your mileage may vary).
Still, these models are fast to train relative to many state-of-the-art industrial
models. The version of AlphaGo that defeated Lee Sedol trained for 4–6
weeks. OpenAI wrote a blogpost that also helps contextualize the
extraordinary computational resources that go into training state-of-the-art
models. From the article:
“ … the amount of compute used in the largest AI
training runs has been increasing exponentially with
a 3.5 month-doubling time … ”
When we have good data and a good model, there is a strong correlation
between training time and model performance. This is why many state-ofthe-art models train on an order of weeks or months once the authors have
confidence in the model’s ability. It seems that patience is still a virtue.
Combining width and depth
With the intuition that more nodes tends to yield better performance, let’s
revisit the question of adding layers. Recall that stacking layers can either
build up meaningful information or destroy information through
downsampling.
Let’s see what happens when we stack bigger layers. Repeated layers of 32
seemed to degrade the overall performance of our networks — will that still
be true as we stack larger layers?
I have increased the number of epochs as the depth of the network increases
for the reasons discussed above. The combination of deeper networks, with
more nodes per hidden layer, and the increased training epochs results in
code that takes longer to run. Fortunately, you can check out the Jupyter
Notebook where the results have already been computed.
You can see all the graphs in the Jupyter Notebook, but I want to highlight a
few points of interest.
With this particular training regimen, the single layer 512-nodes-per-layer
network ended up with the highest test accuracy, at 94.7%. In general, the
trend we saw before — where deeper networks perform worse — persists.
However, the differences are pretty small, in the order of 1–2 percentage
points. Also, the graphs suggest the discrepancies might be overcome with
more training for the deeper networks.
For all numbers of nodes-per-layer, the graphs for one, two, and three hidden
layers look pretty standard. More training improves the network, and the rate
of improvement slows down as accuracy rises.
One 32 node layer
Two 128 node layers
Three 512 node layers
But when we get to four and five layers things start looking bad for the 32node models:
Four 32 node layers.
Five 32 node layers.
The other two five-layer networks have interesting results as well:
Five 128 node layers.
Five 512 node layers.
Both of these seem to overcome some initial poor performance and look as
though they might be able to continue improving with more training. This may
be a limitation of not having enough data to train the network. As our models
become more complex, and as information about error has to propagate
through more layers, our models can fail to learn — they don’t have enough
information to learn from.
Decreasing the batch size by processing fewer data-points before giving the
model a correction can help. So can increasing the number of epochs, at the
cost of increased training time. Unfortunately, it can be difficult to tell if you
have a junk architecture or just need more data and more training time
without testing your own patience.
For example, this is what happened when I reduced the batch size to 16 (from
128) and trained the 32-node-per-layer with five hidden layers for 50 epochs
(which took about 30 minutes on my hardware):
Five 32-node hidden layers, batch size 16, 50 epochs. Final test accuracy: .827
So, it doesn’t look like 32 nodes per layer is downsampling or destroying
information to an extent that this network cannot learn anything. That said,
many of the other networks we’ve built perform better with significantly less
training. Our precious training time would probably be better spent on other
network architectures.
Further steps
While I hope this article was helpful and enlightening, it’s not exhaustive. Even
experts are sometimes confused and unsure about which architectures will
work and which will not. The best way to bolster your intuition is to practice
building and experimenting with neural network architectures for yourself.
In the next article, we’ll explore gradient descent a cornerstone algorithm for
training neural networks.
If you want some homework — and I know you do — Keras has a number of
fantastic datasets built in just like the MNIST dataset. To ensure you’ve
learned something from this article, consider recreating experiments like the
ones I’ve done above using the Fashion MNIST dataset, which is a bit more
challenging than the regular MNIST dataset.
This article was produced by Teb’s Lab. To learn more about the latest in
technology sign up for the Weekly Lab Report, become a patron on Patreon,
or just follow me here on Medium.
Machine Learning
728
5
Neural Networks
Data Science
Python
Technology
Written by Tyler Elliot Bettilyon
Follow
8.7K Followers · Editor for Teb’s Lab
A curious human on a quest to watch the world learn. I teach computer programming and
write about software’s overlap with society and politics. www.tebs-lab.com
More from Tyler Elliot Bettilyon and Teb’s Lab
Tyler Elliot Bettilyon in Teb’s Lab
Tyler Elliot Bettilyon in Teb’s Lab
Cybersecurity Is About Much More
Than Hacking
Deep Neural Networks As
Computational Graphs
The field is largely a mystery to those outside
of it, even as it becomes more important tha…
DNNs Don’t Need To Be A Black Box
· 15 min read · Nov 21, 2018
4.5K
11 min read · Jun 22, 2018
25
795
Tyler Elliot Bettilyon in Teb’s Lab
Tyler Elliot Bettilyon
Types of Graphs
While nodes and edges may have any number
of interesting properties and labels, some…
5 min read · Feb 6, 2019
203
What Is an API and Why Should I
Use One?
Have you ever been asked, “Can’t you just use
an API for that?” and thought to yourself “Wh…
8 min read · Jan 11, 2018
1
See all from Tyler Elliot Bettilyon
6
1.7K
See all from Teb’s Lab
Recommended from Medium
11
Benjamin McCloskey in Towards Data Science
The PyCoach in Artificial Corner
Deep Learning Model Visualization
Tools: Which is Best?
You’re Using ChatGPT Wrong!
Here’s How to Be Ahead of 99% of…
Different ways of visualizing your next deep
learning model for your data science project.
Master ChatGPT by learning prompt
engineering.
· 6 min read · Nov 29, 2022
178
· 7 min read · Mar 17
19.8K
353
Lists
What is ChatGPT?
Stories to Help You Grow as a
Designer
9 stories · 40 saves
11 stories · 26 saves
Good Product Thinking
Stories to Help You Grow as a
Software Developer
11 stories · 43 saves
19 stories · 43 saves
Avinash Navlani
Ali in Level Up Coding
Multi-Layer Perceptron using
Python
5 Killer Python Libraries For Audio
Processing
In this tutorial, we will focus on multi-layer
perception, it's working, and hands-on in…
Data Science projects, music composition,
and lots more…
· 4 min read · Dec 12, 2022
· 4 min read · Nov 28, 2022
88
81
Andrea D'Agostino in Towards Data Science
Matt Chapman in Towards Data Science
Get started with TensorFlow 2.0 —
Introduction to deep learning
The Portfolio that Got Me a Data
Scientist Job
Kickstart your understanding of one of
TensorFlow’s most powerful set of tools for…
Spoiler alert: It was surprisingly easy (and
free) to make
· 11 min read · Nov 22, 2022
59
1
· 10 min read · Mar 24
2.9K
44
Download