Uploaded by Pratik Mandlecha

Lottery Ticket Hypothesis Summary

advertisement
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
Paper Summary
The authors of the paper are Jonathan Frankle and Michael Carbin. Professor Carbin is
an associate professor at MIT and he leads the MIT programming systems group. He has
won numerous best paper awards in renowned conferences and has been awarded
multiple research fellowships since 2006. Before this paper got published, he has been
working in areas like Sparse Deep Learning, Sparse Neural Networks, and performance
models. These topics align with the research domain covered in this paper and this makes
him qualified for writing this paper. Jonathan Frankle was a PhD student at MIT under
Prof. Carbin, where he studied behavior of practical neural networks and properties of
sparse networks for training them efficiently. The expertise in these research areas make
him aptly qualified for this paper.
The audience of the paper is the researchers and students who are interested in the field
of deep learning and are working with deep neural networks. This paper is also important
to the corporate sector as many companies that use deep neural networks in their
products can benefit from this research.
2019, the year when the paper got published, was the time when the research interest in
deep learning was ever-increasing and was at its all-time high till that time. The work in
the related domains which relied on the applications of deep learning had notably
increased over the last three years. Even the industrial sector and corporate companies
had started making products based on applications of deep neural networks. This led to
increase in the efforts to achieve better accuracies with lesser memory usage using
neural networks. Previous works had tried to find useful pruning methods to get a smaller
network which gave similar accuracies as the original network. But it was possible only
after the main network was trained. This paper was a major milestone in this research
domain to optimize deep neural network models as it overcame a significant blocker faced
by the previous researchers as it led to the discovery of a method to train the pruned
neural networks from scratch which could achieve an accuracy equal or better than the
original network in lesser number of iterations. This research could potentially give a
breakthrough to the corporate companies, students, and the researches who are working
with deep neural networks.
This paper mainly overcomes a challenge faced by the researchers who tried to train
pruned neural networks from scratch for obtaining accuracies equivalent to the original
network. Its main contribution is to improve computational necessity of neural networks
and find a more generalized pruned network by finding a sparse representation from the
dense neural networks. The paper also makes use of iterative pruning as compared to
traditionally used one-shot pruning. Iterative pruning gave better accuracies as compared
to the previous methods. An important insight from the paper was finding the importance
of weight initialization in the pruned neural networks as compared to random initialization
that were done previously.
The paper proposes lottery ticket hypothesis, which states that a dense neural network
that has been initialized randomly, consists of sparse subnetworks (winning tickets) with
the same weight initialization as the dense network which, when trained independently
can perform equivalently or better than the original network in same or lesser number of
iterations. The paper showcases the benefits of iterative pruning of neural networks as
compared to the one-shot pruning method which has been used in previous research.
The paper also claims to have progressed in steps to achieve more generalized deep
neural network models as compared to the overfitted architectures. The paper also claims
that the test accuracy increases with pruning until the threshold where network is pruned
to 13.5% of its original structure.
The optimal pruned networks that are also referenced as the winning tickets are obtained
from the fully connected architecture for MNIST and convolutional architecture for
CIFAR10 dataset across optimization strategies like Adam, momentum and SGD with the
techniques like batch-norm, weight decay, residual connections and dropout. They prune
the connections which have the lowest magnitude of the weights and the connections to
the output are pruned at half of the rate as compared to the rest of the network. The
pruning strategy to find the winning ticket is dependent on the learning rate. Thus, a
warmup is done to find the pruned networks at high learning rates. The sparsity mask Pm
depicts the percentage of the residual network after pruning. From the graph of Pm vs
number of iterations, it can be concluded that for the first few iterations, the more pruned
networks learn faster and reaches higher test accuracies. Also, it is observed that the
optimal pruned network learns quicker as the Pm drops from 100% to 21%, after which
even if Pm reduces, the learning gets slower. The test accuracy increases till a threshold
and then decreases which leads to a formation of Occam’s hill. A graph comparing
training accuracy at the end of 50000 iterations with the percentage of weights in the
residual graph after pruning suggests that in the winning tickets, the difference between
the training and testing accuracies is minimum. This implies that the optimal pruned
networks are generalized.
The paper successfully proved that highly overparameterized networks can be effectively
pruned, reinitialized, and retrained up to a certain level of sparsity. It also proved how
iterative pruning is a better way to prune weights from the dense network as compared to
one-shot pruning. It also supported the claim about the winning ticket being more
generalized than the original network. Thus, the paper could provide support for their
claim in lottery ticket hypothesis.
However, the results are obtained by using small datasets and only the vision-centric
classification tasks are considered. The iterative pruning method is computationally very
expensive to implement and thus isn’t very practical to use in day-to-day networks. Also
learning rate warmup is necessary to train the deeper networks like Resnet-18 and VGG19. These open up the scope for potential enhancements in the area and paves a
direction that the research can take.
The paper assessed its effectiveness based on the percentage of network it can prune
while still maintaining the accuracy of the original network. This way to measure success
led to insights about the threshold till which the deep networks can be pruned to achieve
accuracy equivalent to the larger architecture.
The paper shows the importance of weight initialization in the pruned neural networks.
Starting with the same weights in the pruned network as that of the pre-trained deep
neural network will make it possible to train the pruned network in isolation to an
equivalent or better accuracy as compared to the original network. This will lead to
improvements in the training performance as it is computationally easier to train the
pruned networks. Thus, one can design better networks based on the learnings from the
research in the paper. The paper also improves ones understanding of theory behind
neural networks. It explains the pros of using iterative pruning instead of one-shot pruning
to reduce the size of the deep network. This paper will make one understand the process
and steps to get more generalized and computationally efficient neural networks to solve
deep learning problems. This paper also depicts the robustness of deep learning models
to the exclusion of extra parameters. The more generalized and sparser models that can
be achieved through this paper will be better to use in solving problems in fields like NLP
and Computer Vision as the test data can be remarkably different from the training data.
Download