The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks Paper Summary The authors of the paper are Jonathan Frankle and Michael Carbin. Professor Carbin is an associate professor at MIT and he leads the MIT programming systems group. He has won numerous best paper awards in renowned conferences and has been awarded multiple research fellowships since 2006. Before this paper got published, he has been working in areas like Sparse Deep Learning, Sparse Neural Networks, and performance models. These topics align with the research domain covered in this paper and this makes him qualified for writing this paper. Jonathan Frankle was a PhD student at MIT under Prof. Carbin, where he studied behavior of practical neural networks and properties of sparse networks for training them efficiently. The expertise in these research areas make him aptly qualified for this paper. The audience of the paper is the researchers and students who are interested in the field of deep learning and are working with deep neural networks. This paper is also important to the corporate sector as many companies that use deep neural networks in their products can benefit from this research. 2019, the year when the paper got published, was the time when the research interest in deep learning was ever-increasing and was at its all-time high till that time. The work in the related domains which relied on the applications of deep learning had notably increased over the last three years. Even the industrial sector and corporate companies had started making products based on applications of deep neural networks. This led to increase in the efforts to achieve better accuracies with lesser memory usage using neural networks. Previous works had tried to find useful pruning methods to get a smaller network which gave similar accuracies as the original network. But it was possible only after the main network was trained. This paper was a major milestone in this research domain to optimize deep neural network models as it overcame a significant blocker faced by the previous researchers as it led to the discovery of a method to train the pruned neural networks from scratch which could achieve an accuracy equal or better than the original network in lesser number of iterations. This research could potentially give a breakthrough to the corporate companies, students, and the researches who are working with deep neural networks. This paper mainly overcomes a challenge faced by the researchers who tried to train pruned neural networks from scratch for obtaining accuracies equivalent to the original network. Its main contribution is to improve computational necessity of neural networks and find a more generalized pruned network by finding a sparse representation from the dense neural networks. The paper also makes use of iterative pruning as compared to traditionally used one-shot pruning. Iterative pruning gave better accuracies as compared to the previous methods. An important insight from the paper was finding the importance of weight initialization in the pruned neural networks as compared to random initialization that were done previously. The paper proposes lottery ticket hypothesis, which states that a dense neural network that has been initialized randomly, consists of sparse subnetworks (winning tickets) with the same weight initialization as the dense network which, when trained independently can perform equivalently or better than the original network in same or lesser number of iterations. The paper showcases the benefits of iterative pruning of neural networks as compared to the one-shot pruning method which has been used in previous research. The paper also claims to have progressed in steps to achieve more generalized deep neural network models as compared to the overfitted architectures. The paper also claims that the test accuracy increases with pruning until the threshold where network is pruned to 13.5% of its original structure. The optimal pruned networks that are also referenced as the winning tickets are obtained from the fully connected architecture for MNIST and convolutional architecture for CIFAR10 dataset across optimization strategies like Adam, momentum and SGD with the techniques like batch-norm, weight decay, residual connections and dropout. They prune the connections which have the lowest magnitude of the weights and the connections to the output are pruned at half of the rate as compared to the rest of the network. The pruning strategy to find the winning ticket is dependent on the learning rate. Thus, a warmup is done to find the pruned networks at high learning rates. The sparsity mask Pm depicts the percentage of the residual network after pruning. From the graph of Pm vs number of iterations, it can be concluded that for the first few iterations, the more pruned networks learn faster and reaches higher test accuracies. Also, it is observed that the optimal pruned network learns quicker as the Pm drops from 100% to 21%, after which even if Pm reduces, the learning gets slower. The test accuracy increases till a threshold and then decreases which leads to a formation of Occam’s hill. A graph comparing training accuracy at the end of 50000 iterations with the percentage of weights in the residual graph after pruning suggests that in the winning tickets, the difference between the training and testing accuracies is minimum. This implies that the optimal pruned networks are generalized. The paper successfully proved that highly overparameterized networks can be effectively pruned, reinitialized, and retrained up to a certain level of sparsity. It also proved how iterative pruning is a better way to prune weights from the dense network as compared to one-shot pruning. It also supported the claim about the winning ticket being more generalized than the original network. Thus, the paper could provide support for their claim in lottery ticket hypothesis. However, the results are obtained by using small datasets and only the vision-centric classification tasks are considered. The iterative pruning method is computationally very expensive to implement and thus isn’t very practical to use in day-to-day networks. Also learning rate warmup is necessary to train the deeper networks like Resnet-18 and VGG19. These open up the scope for potential enhancements in the area and paves a direction that the research can take. The paper assessed its effectiveness based on the percentage of network it can prune while still maintaining the accuracy of the original network. This way to measure success led to insights about the threshold till which the deep networks can be pruned to achieve accuracy equivalent to the larger architecture. The paper shows the importance of weight initialization in the pruned neural networks. Starting with the same weights in the pruned network as that of the pre-trained deep neural network will make it possible to train the pruned network in isolation to an equivalent or better accuracy as compared to the original network. This will lead to improvements in the training performance as it is computationally easier to train the pruned networks. Thus, one can design better networks based on the learnings from the research in the paper. The paper also improves ones understanding of theory behind neural networks. It explains the pros of using iterative pruning instead of one-shot pruning to reduce the size of the deep network. This paper will make one understand the process and steps to get more generalized and computationally efficient neural networks to solve deep learning problems. This paper also depicts the robustness of deep learning models to the exclusion of extra parameters. The more generalized and sparser models that can be achieved through this paper will be better to use in solving problems in fields like NLP and Computer Vision as the test data can be remarkably different from the training data.