Reevaluating The All Convolutional Net 000 001 001 002 002 Jonas Wechsler, Lars Vagnes, Nicholas Jang 003 004 DD2424 Deep Learning 005 Abstract. We analyze and reproduce a network composed of all convolutional layers proposed by Springenberg et al. and create a similar model that trains much faster and achieves comparable results. First, we assess the training method proposed and remove techniques that increase training time while not providing benefit. Second, we gauge the efficacy of different initialization techniques and training techniques to accelerate training. Third, we use the decreased training time to allow us to conduct a thorough hyper-parameter search. We decrease the training time by 90% and achieve a final validation accuracy of 91.14%. 008 009 010 011 012 013 014 015 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 005 008 009 010 011 012 013 014 015 016 016 019 004 007 007 018 003 006 006 017 000 1 Introduction The most prevalent CNN architecture to be found in literature consists of alternating convolutional and pooling layers. In the last several years much effort has been spent on further improving this architecture by way of utilizing more complex activation functions, unsupervised pre-training, and improved methods of regularization. However, Springenberg et al. demonstrate that even with these proposed improvements, the standard CNN architecture does not exhibit superior performance on image classification tasks compared to the simpler All-CNN architecture [1]. In fact, All-CNN, the architecture proposed by Springenberg et al., performs on par or better. This is significant because the the All-CNN model, implements dimensionality reduction between layers, not by pooling, but by convolutional layers with stride 2. The All-CNN models achieve some of the lowest error rates on the CIFAR-10 dataset out of all published models. Springenberg et al. claim that the All-CNNC model achieves a 9.08% error rate without data augmentation and 7.25% error rate with data augmentation after 350 epochs of training. First, we attempt to replicate their results based on the architecture and hyper-parameters described in the literature. Second, we decrease the training time required to reach the <10% error rate cited in the paper. The All-CNN-C model, when trained as described in the paper, takes 12 hours or longer on a modern GPU. The paper emphasizes that they did not do a thorough search for parameters such as learning rate, and a more extensive search could yield better results. We decrease the time required to train the model three ways: First, we remove training techniques that increase training time but offer negligible benefit, in particular the use of ZCA whitening and a dropout layer applied to the input. Second, we integrate other techniques not described in the paper that accelerate training, and compare different initializations techniques. Third, we decrease the training time 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 2 045 046 Jonas Wechsler, Lars Vagnes, Nicholas Jang to 100 epochs and use the time savings to conduct a more thorough parameter search. 049 050 051 2 2.1 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 048 Background 049 Striving for Simplicity: The All Convolutional Net by Dmytro Mishkin et al. Springenberg et al. propose three different all-convolutional ”base nets”, which are described in Table 1, as well as three additional variations of each network. In this paper we are attempting to verify and augment the work on the All-CNN-C model (described in Table 2). Springenberg et al.’s training scheme for the All-CNN-C model is as follows. They train the model using stochastic gradient descent and a momentum of 0.9. The learning rate decreases by a factor of 0.1 after a set number of epochs elapsed. They train the first 200 epochs with the initial learning rate, and then decrease the rate after every 50 epochs for a total of 350 epochs. They do not specify an initial learning rate for this model, but the learning rate is chosen from the set [0.25, 0.1, 0.05, 0.01]. A weight decay of 0.001 was also applied. Data augmentation was done by randomly horizontally flipping images and applying a 5px random translation in all directions. Images were also whitened and contrast normalized. There was no specification of batch size or number of iterations. The network had dropout applied to the input and pooling layers, as described in Table 2. For contrast normalization, we found the 2nd and 98th percentile of intensities for the three color channels, and then stretching the intensity so that those percentiles are the max and min intensities for the image. For an image of values in the range [0,1] with a 2nd percentile of ` and a 98th percentile of h, the following equation is applied to each channel for each pixel p: 074 pnew = 075 076 p−` h−` 079 080 081 082 083 084 085 086 087 088 089 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 077 078 046 047 047 048 045 2.2 All you need is a good INIT by Dmytro Mishkin et al. In the process of training deep CNNs, problems can occur with exploding or (more rarely) vanishing gradients. Others have developed a wide variety of endto-end training practices to avoid these phenomena, but they are lacking in standardization. Mishkin et al. proposed a novel approach to network initialization, Layer Sequential Unit Variance initialization (LSUV), by which they demonstrate that state of the art performance can be achieved on very deep and thin CNN’s through this simple initialization techinique [2]. This technique is an extension of the orthonormal initialization proposed by Saxe et al [3]. The LSUV initialization algorithm consists of two main steps: first initialize all weights with Gaussian noise with unit variance. Second use SVD or QR decomposition to transform the 078 079 080 081 082 083 084 085 086 087 088 089 ECCV-16 submission ID *** 090 091 3 weights into orthonormal basis. Then proceed to estimate the variance of each individual layers output, and force it to 1 by scaling the weights accordingly. 094 095 096 097 098 099 100 101 102 103 104 105 106 107 091 092 092 093 090 2.3 Xavier initialization, He initialization, and Batch Normalization In our approach, we compare the LSUV initialization with other initializations, namely Xavier initialization combined with batch normalization and He normal and He uniform initialization. Xavier initialization draws from Glorot and Bengio’s paper [4] to combat the vanishing and exploding gradient problem. They initialize weights based on a uniform distribution such that the variances up and down the network remain the same. He et al. propose an initialization based on the Xavier initialization but adapted for a non-linear activation function, namely ReLU [5]. In our approach, we explore both the He uniform and He normal initialization, drawn from the uniform and normal distribution respectively. Ioffe and Szegedy [6] introduce batch normalization which can accelerate deep network training and mitigate bad initializations. By scaling each mini-batch with regard to its mean and variance, batch normalization acts like a regularizer and allows more reliable training. 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 108 108 109 109 Table 1. The three base networks proposed by Springenberg et al. 110 111 B C Input 32 × 32 RGB Image 5 × 5 conv. 96 ReLU 5 × 5 conv. 96 ReLU 3 × 3 conv. 1 × 1 conv. 96 ReLU 3 × 3 conv. 3 × 3 max-pooling stride 2 5 × 5 conv. 192 ReLU 5 × 5 conv. 192 ReLU 3 × 3 conv. 1 × 1 conv. 192 ReLU 3 × 3 conv. 3 × 3 max-pooling stride 2 3 × 3 conv. 192 ReLU 1 × 1 conv. 192 ReLU 1 × 1 conv. 10 ReLU Global averaging over 6 × 6 spatial dimensions 10 or 100-way softmax 111 A 112 113 114 115 116 117 118 119 120 121 122 123 110 112 96 ReLU 96 ReLU 113 114 115 192 ReLU 192 ReLU 116 117 118 119 120 121 122 123 124 124 125 125 126 126 127 3 Approach 129 130 131 132 133 134 127 128 128 In order to verify the results of the All-CNN-C model, we trained for 350 epochs with learning rates of 0.01, 0.05, and 0.1. No batch size was specified in the paper, so we used a batch size of 64. The paper stated that they used a ZCA whitening filter. We first applied the filter after the random translations during training, which dramatically increased training time. We found that applying the filter to all data points before training made no difference in accuracy, and 129 130 131 132 133 134 4 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 Jonas Wechsler, Lars Vagnes, Nicholas Jang Table 2. The All-CNN-C model proposed by Springenberg et al. The 20% dropout after the input layer is omitted in our model. All-CNN-C Input 32 × 32 RGB Image 20% dropout* 3 × 3 conv. 96 ReLU 3 × 3 conv. 96 ReLU 3 × 3 conv. 96 ReLU with 2 × 2 stride 50% dropout 3 × 3 conv. 96 ReLU 3 × 3 conv. 96 ReLU 3 × 3 conv. 96 ReLU with 2 × 2 stride 50% dropout 3 × 3 max-pooling stride 2 3 × 3 conv. 192 ReLU 1 × 1 conv. 192 ReLU 1 × 1 conv. 10 ReLU Global averaging over 6 × 6 spatial dimensions 10 or 100-way softmax 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 153 154 154 155 155 156 157 that furthermore omitting the filter altogether made little difference in accuracy, so the filter was omitted from our experiments. 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 157 158 158 159 156 We conducted a parameter search for the optimal learning rate and decay rate for the model. First, we searched for the learning rate from 0.005 to 0.1, using 20 epochs per run. For these tests we kept the decay rate fixed at 1e-5. We then searched for the decay rate from 1e0 to 1e-9, using 100 epochs per run. We used a batch size of 32 epochs for all our tests. As with the All-CNN-C model, we did not include ZCA whitening because it did not provide a significant boost in accuracy. We found that omitting the first dropout layer increased accuracy in our model. This can be attributed to the large relative size of the network compared to the training data, coupled with the data augmentation that goes into the network, which prevents the network from overfitting without the additional dropout. For both the All-CNN-C model and our parameter search we use Xavier initialization without batch normalization. 159 We then tested the initialization mentioned in the LSUV paper with the default All-CNN-C model and compared the results with other initializations. We first tested the LSUV initialization, then the He normal and He uniform initialization, and finally with Xavier initialization with batch normalization. To keep our results consistent, we tested over 100 epochs with the All- CNN-C model with a learning rate of 0.01 and decay of 1e-6. For the batch normalization model, we added a batch normalization layer before the ReLU activation function for each convolutional layer. After finding the best-performing parameters and initialization method, we ran the model with both. 171 160 161 162 163 164 165 166 167 168 169 170 172 173 174 175 176 177 178 179 ECCV-16 submission ID *** 180 4 5 Experiments, Observations, Conclusions 181 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 180 We found it difficult to reproduce the results of the All-CNN-C model as given by Springenberg et al., because some key hyper-parameters were not specified by the paper. In particular, the model would have performed much better with a smaller decay rate. Decay rate varies per iteration of the model, but batch size was not specified, so the number of iterations we use over training may vary widely from the original model. The size of the network makes it impractical to search for decay rate or batch size. The model may also perform better with a larger batch size. When the model was run with large learning rates (0.1, 0.25), it made no improvements in cost or accuracy in the first 100 epochs, so the results were omitted. Model accuracy changed negligibly after 160 epochs, and stabilizes entirely after 200 epochs. For the All-CNN-C model, a larger learning rate (0.05) performs better, as shown in figure 1. However, for our model, we found that a smaller learning rate (0.015) performs better, which could also indicate that the decay rate is too large for the All-CNN-C model. Note our architecture does differ in that it leaves out the first dropout layer on the input. The best performing version of All-CNN-C got 84.7% with a learning rate of 0.05, which is far below the cited accuracy of 92.75%. We compared learning rates over 20 epochs while keeping the decay static and small. Because we are running the model for a small number of epochs, a change in decay rate shouldn’t change the results. The parameter search yielded a concave function that peaks with a learning rate of 0.01 or 0.015, with the accuracy quickly falling off with smaller learning rates. The dropout in our model prevents it from overfitting, which means that the model with the highest training accuracy should perform the best and have the least noise, so we used a learning rate of 0.015, which had the highest train accuracy. We then compare decay rates on the 100 epoch model with learning rates 0.012, 0.015, and 0.017. We used these three learning rates for two reasons: to see if we needed to conduct a finer search for the learning rate, and to see if a changing decay rate changed the affect the learning rate had. We found that the learning rate of 0.015 still performed best, and a changing decay rate did not change the optimal learning rate. We found that decay rates of 1e-4 and 1e-5 perform best, so we used a decay rate of 3e-5. We found that the decay rate used by Springenberg et al. is much too small for our model, even though our model has fewer epochs, which again suggests that the decay rate was too large for the All-CNN-C model. We tested the LSUV, He normal, and He uniform initializations along with the Xavier initialization with batch normalization. In figures 4 through 8 we show the accuracy and loss of our ALL-CNN-C models over 100 epochs. The final test accuracies were 87.68%, 88.23%, 87.42%, and 91.13% for the LSUV, He normal, He uniform, and batch normalization models respectively. The final validation losses were 0.466, 0.480, 0.474, and 0.338 for the LSUV, He normal, He uniform, and batch normalization models respectively. Overall, the batch normalization model had significantly higher test and training accuracies and lower validation 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 6 Jonas Wechsler, Lars Vagnes, Nicholas Jang 225 225 226 227 Fig. 1. The after-epoch accuracy for the All-CNN-C model, trained with a changing learning rate as specified in section 2. Train (lr=0.05) Test (lr=0.05) Train (lr=0.01) Test (lr=0.01) 228 229 0.8 231 232 233 234 Accuracy 230 0.6 227 228 229 230 231 232 233 0.4 234 235 235 236 226 0.2 236 237 237 238 0 50 100 150 200 Epoch 239 250 300 238 350 239 240 240 241 241 242 242 243 244 245 246 247 Fig. 2. Accuracy per Learning Rate over 20 Epoch runs. Training and testing accuracy are taken as the individual maximum over the run, so testing accuracy may surpass training accuracy due to noise. Train 0.84 Test 251 252 Accuracy 250 257 260 261 262 263 264 265 266 267 268 269 247 251 252 0.8 253 254 255 0.78 0.5 1 1.5 2 Learning Rate 2.5 3 256 −2 257 ·10 258 258 259 246 250 254 256 245 249 0.82 253 255 244 248 248 249 243 259 Table 3. Cifar-10 classification accuracy Method 100 epoch model w/ param search 100 epoch model w/o param search 100 epoch model w/o param search 100 epoch model w/o param search 100 epoch model w/o param search 100 epoch model w/ param search All-CNN-C (ours) All-CNN-C (paper) Initialization type Xavier w/ batch norm Xavier w/ batch norm He normal LSUV He uniform Xavier w/o batch norm Xavier w/o batch norm unspecified 260 Accuracy 91.14% 91.13% 88.23% 87.68% 87.42% 90% 84.7% 92.75% 261 262 263 264 265 266 267 268 269 ECCV-16 submission ID *** 270 7 270 Fig. 3. Accuracy per Decay Rate over 100 Epoch runs. 271 1 272 271 Train Test 273 0.8 274 Accuracy 275 276 277 278 279 275 276 0.6 277 278 279 0.4 280 281 281 282 10 283 −9 −8 10 10 −7 284 −6 10 −5 −4 10 10 Decay 10 −3 −2 10 10 −1 288 289 290 291 and training losses over 100 epochs. In addition, the batch normalization model consistently outperformed other models throughout the 100 epochs. We find that the LSUV initialization did not enhance the performance of our 100 epoch network. In fact, the He normal, He uniform, and LSUV initialized models all performed fairly similar. 297 298 299 300 301 302 303 308 289 290 291 294 295 296 297 298 299 300 302 303 304 0.4 305 306 306 307 288 301 0.6 304 305 287 293 Fig. 4. The test accuracy of the 100 epoch model with LSUV initialization, He normal initialization, He uniform initialization, and Xavier initialization with batch normalization. LSUV He normal He uniform 0.8 Batch norm Accuracy 296 286 292 293 295 283 285 292 294 282 284 285 287 273 274 280 286 272 0 20 40 60 Epoch 80 100 307 308 309 309 310 310 311 312 313 314 Running our 100 epoch model with Xavier initialization, batch normalization, and the training hyper-parameters yields a verification accuracy of 91.14% and a training accuracy of 96.41%. Training the 350 epoch model took an average of 143 seconds per epoch, while the 100 epoch model took an average of 50 seconds. 311 312 313 314 8 Jonas Wechsler, Lars Vagnes, Nicholas Jang 315 315 316 316 317 317 318 318 319 320 321 322 Fig. 5. The training accuracy of the 100 epoch model with LSUV initialization, He normal initialization, He uniform initialization, and Xavier initialization with batch normalization. 1 LSUV He normal He uniform Batch norm 323 0.8 325 326 327 Accuracy 324 321 322 323 324 326 0.6 327 328 0.4 329 330 330 331 320 325 328 329 319 331 0.2 0 332 20 40 333 60 Epoch 80 100 332 333 334 334 335 335 336 336 337 337 338 338 339 339 340 340 341 341 342 343 344 345 Fig. 6. The training accuracy of the 100 epoch model with learning rate 0.015, decay rate 3e-5, and Xavier initialization with batch norm. 1 Test Train 349 350 Accuracy 348 343 344 345 346 346 347 342 0.8 347 348 349 350 0.6 351 351 352 352 353 353 354 355 356 0.4 0 20 40 60 Epoch 80 100 354 355 356 357 357 358 358 359 359 ECCV-16 submission ID *** 360 361 362 363 364 365 366 367 368 369 370 371 372 9 Both were done on a modern GPU with the same hardware. The original model took about 14 hours to train, while ours took less than an hour and a half. This gave a 90% decrease in training time. We have a few key conclusions from our study: – Some methods that have been shown to increase performance for other architectures may not increase performance for your architecture, while they will significantly increase training time. – With sufficiently large data sets and data augmentation, input dropout can be omitted. – It is possible to achieve comparable results with a shorter training time and a more thorough analysis of the training method and hyper-parameters. – Sufficient model specification is critical to reproducing a model. 360 361 362 363 364 365 366 367 368 369 370 371 372 373 373 374 374 375 375 376 376 377 377 378 378 379 379 380 380 381 381 382 382 383 383 384 384 385 385 386 386 387 387 388 388 389 389 390 390 391 391 392 392 393 393 394 394 395 395 396 396 397 397 398 398 399 399 400 400 401 401 402 402 403 403 404 404 10 405 Jonas Wechsler, Lars Vagnes, Nicholas Jang References 406 406 407 408 409 410 411 412 413 414 415 416 405 1. Jost Tobias Springenberg, Alexey Dosovitskiy, T.B.M.R.: Striving for simplicity: The all convolutional net (2015) 2. Dmytro Mishkin, J.M.: All you need is a good init (2016) 3. Andrew M. Saxe, James L. McClelland, S.G.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks (2013) 4. Xavier Glorot, Y.B.: Understanding the difficulty of training deep feedforward neural networks (2010) 5. Kaiming He, Xiangyu Zhang, S.R.J.S.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classifications (2015) 6. Sergey Ioffe, C.S.: Batch normalization: Accelerating deep network training by reducing internal covariate shift (2015) 407 408 409 410 411 412 413 414 415 416 417 417 418 418 419 419 420 420 421 421 422 422 423 423 424 424 425 425 426 426 427 427 428 428 429 429 430 430 431 431 432 432 433 433 434 434 435 435 436 436 437 437 438 438 439 439 440 440 441 441 442 442 443 443 444 444 445 445 446 446 447 447 448 448 449 449