practical_advice

Practical Advice For Building Neural Nets Deep Learning and Neural Nets Spring 2015 Day’s Agenda 1. Celebrity guests 2. Discuss issues and observations from homework 3. Catrin Mills on climate change in the Arctic 4. Practical issues in building nets Homework Discussion  Lei’s problem with the perceptron not converging  demo  Manjunath’s questions about priors  Why does the balance of data make a difference in what is learned? (I believe it will for Assignment 2)  If I’d told you the priors should be 50/50, what would you have to do differently?  What’s the relationship between statistics of the training set and the test set?  Suppose we have a classifier that outputs a probability (not just 0 or 1). If you know the priors in the training set and you know the priors in the test set to be different, how do you repair the classifier to accommodate the mismatch between training and testing priors? Weight Initialization  Break symmetries  use small random values   Weight magnitudes should depend on fan in What Mike has always done  Draw all weights feeding into neuron j (including bias) via w ji ∼ Normal(0,1)  Normalize weights such that i.e., 2 w ji ¬ w ji å w ji =2 i w ji  Works well for logistic units; to be determined for ReLU units Weight Initialization  A perhaps better idea (due to Kelvin Wagner)  Draw all weights feeding into neuron j (including bias) via w ji ∼ Normal(0,1)  If input activities lie in [-1, +1], then variance of input to unit j grows with fan-in to j, fj  Normalize such that variance of input is equal to C2 i.e., w ¬ C w ji fj ji  If input activities lie in [-1, +1], most net inputs will be in [-2C, +2C] Does Weight Initialization Scheme Matter?  Trained 3 layer net on 2500 digit patterns  10 output classes, tanh units  20-1600 hidden units  automatic adaptation of learning rates  10 minibatches per epoch  500 epochs  12 replications of each architecture Does Weight Initialization Scheme Matter?  Weight initialization schemes  Gaussian random weights, N(0,.0012)  Gaussian random weights, N(0,.012)  L1 constraint on Gaussian random weights [conditioning for worst case]  Gaussian weights, N(0, 4/FanIn) [conditioning for average case]  Gaussian weights, N(0, 1/FanIn) [conditioning for average case] å w ji i =2 Small Random Weights  Strangeness  training set can’t be learned caveat: plotting accuracy not MSE  if there’s overfitting, doesn’t happen until 200 hidden Mike’s L1 Normalization Scheme  About the same as small random weights Normalization Based On Fan In Perfect performance on training set Test set performance dependent on scaling Conditioning The Input Vectors   If mi is the mean activity of input unit i over the training set and si is the std dev over the training set For each input (in both training and test sets), normalize by a x a i - xi xi ¬ si  where xi is the training set mean activity and si is the std deviation of the training set activities Conditioning The Hidden Units  If you’re using logistic units, then replace logistic output with function scaled from -1 to +1 2 y= -net - 1 1+ e ¶y 1 = (1+ y)(1- y) ¶net 2  With net=0, y=0  Will tend to cause biases to be closer to zero and more on the same scale as other weights in network  Will also satisfy assumption I make to condition initial weights and weight updates for the units in the next layer  tanh function Setting Learning Rates I  Initial guess for learning rate  If error doesn’t drop consistently, lower initial learning rate and try again  If error falls reliably but slowly, increase learning rate.  Toward end of training  Error will often jitter, at which point you can lower the learning rate down to 0 gradually to clean up weights  Remember, plateaus in error often look like minima  be patient  have some idea a priori how well you expect your network to be doing, and print statistics during training that tell you how well it’s doing  plot epochwise error as a function of epoch, even if you’re doing minibatches (t a - ya )2 å NormalizedError = å (t a - t )2 Setting Learning Rates II  Momentum ¶E  Dwt+1 = qDwt - (1- q )e ¶w t  Adaptive and neuron-specific learning rates  Observe error on epoch t-1 and epoch t  If decreasing, then increase global learning rate, εglobal, by an additive constant  If increasing, decrease global learning rate by a multiplicative constant  If fan-in of neuron j is fj, then e j = e global fj Setting Learning Rates III   Mike’s hack Initialization epsilon = .01 inc = epsilon / 10 if (batch_mode_training) scale = .5 else scale = .9  Update if (current_epoch_error < previous_epoch_error) epsilon = epsilon + inc saved_weights = weights else epsilon = epsilon * scale inc = epsilon / 10 if (batch_mode_training) weights = saved_weights Setting Learning Rates IV  rmsprop  Hinton lecture  Exploit optimization methods using curvature  Requires computation of Hessian When To Stop Training  1. Train n epochs; lower learning rate; train m epochs  bad idea: can’t assume one-size-fits-all approach  2. Error-change criterion  stop when error isn’t dropping  My recommendation: criterion based on % drop over a window of, say, 10 epochs 1 epoch is too noisy absolute error criterion is too problem dependent  Karl’s idea: train for a fixed number of epochs after criterion is reached (possibly with lower learning rate) When To Stop Training  3. Weight-change criterion  Compare weights at epochs t-10 and t and test: t t-10 maxi wi - wi <q  Don’t base on length of overall weight change vector  Possibly express as a percentage of the weight  Be cautious: small weight changes at critical points can result in rapid drop in error

practical_advice

Related documents

Products

Support

practical_advice

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib