Handwriting,signatures, and convolutions Ben Graham University of Warwick Department of Statistics and Centre for Complexity Science November 2014 Machine Learning I Machine learning: building systems that can learn from data I Classication problems: i.e. handwriting recognition I In general it is very dicult I Over the last n years there have been huge improvements I Learning Algorithms I I I SVMs Back propagation for non-convex optimization of ANNs Convolutional NNs I Computing power / GPUs I Challenging datasets I MNIST I Google Street View House Numbers I CIFAR-10, CIFAR-100 and ImageNet I CASIA-OLHWD1.0 Machine learning (Paul Handel, 1931) MNIST (Oine) I I I I 60,000 training images, 10,000 test images. Greyscale 28x28x1 784 features: ANN with dropout 99% (Hinton) 28x28 grid: CNN with dropconnect 99.8% (LeCun) 28x28 grid: Need ∼3000 training samples for 99% accuracy Online handwriting recognition I (For simplicity) Consider an isolated character I Each character is made up of a number of strokes I Each stroke is a list of (x,y) coordinates I Hard if I Small #training samples/character class I Lots of character classes I Lots of variation between writers. I Simple strategies I trace the strokes to produce an oine bitmap I I take advantage of oine classiers morally wrong I produce a low resolution array of pen direction histograms I need to iron out the characters to approach invariance n = 10: Pendigits n = 183: Assamese (UCI dataset) n = 183: Assamese (UCI dataset) n ≈ 7000: Chinese (CASIA-OLHWDB datasets) Y (0) = 0 dY (t ) = f (Y (t ))dX (t ) Y0 (t0 ) Y1 (t1 ) Y2 (t2 ) Y3 (t3 ) = Z t3 0 f = = Z t1 0 Z t2 Z t1 0 f 0 Z t2 Z t1 0 f =0 0 Picard Iteration f (0)dX (t0 ) f (0)dX (t0 ) dX (t1 ) ! f (0)dX (t0 ) dX (t1 ) dX (t2 ) ... Tensors Tensor product (xi )ai=1 ⊗ (yj )bj=1 = (xi yj )ai =,b1, j =1 Ra ⊗ Rb ≡ Rab Example: Probability distributions of indepedent random variables X ∈ { 1, . . . , a } L(X ) ∈ Ra L(X , Y ) Y ∈ {1, . . . , b} L(Y ) ∈ Rb = L(X ) ⊗ L(Y ) ∈ Rab Iterated integrals Driving path X : [0, 1] → Rd n -th iterated integral n = Z X0,1 0<u1 <···<u <1 1 dX (u1 ) ⊗ . . . ⊗ dX (un ) ∈ Rd n Path signature 1 2 s t = (1, Xs ,t , Xs ,t , . . . ) linear,(f (n) ) operator depending on f S (X ) , Picard iteration: f Y (1) hard ∞ = easy ∑ f (n) X0n,1 n =0 Computationally: Path←−− −−→ Signature n Sound and rough paths 0.5 0 −0.5 0 1 2 3 4 5 6 4 x 10 Frequency 8000 6000 4000 2000 0 0.5 1 1.5 2 2.5 3 200 250 300 Time 20 15 10 5 50 I I 100 150 Lyons and Sidorova: Sound compression via the signature. Limiting factor: signature →path Papavasiliou: Sound recognition from the signaturetime lag. Calculating signatures I If Xs ,t is a straight line from 0 to x then x ⊗x x ⊗x ⊗x S (X )s ,t = 1, x , , ,... 2! 3! I Chen's identity: If s , t , u ∈ R then n su= X , I n ∑ Xsk,t ⊗ Xtn,u−k k =0 n = 0, 1, 2, . . . Higher order iterated integrals are Hölder smoother Log signatures I Tensor log I Free Lie algebra: bracket [·, ·] : g × g → g I I I I log(1 + x ) = − ∑ (−x )n /n n≥1 [ax + by , z ] = a[x , z ] + b [y , z ], [z , ax + by ] = a[z , x ] + b [z , y ] [x , x ] = 0 [x , [y , z ]] + [z , [x , y ]] + [y , [z , x ]] = 0 Dimensionality reduction: Hall basis 1, 2, [1, 2], [1, [1, 2]], [2, [1, 2]], [1, [1, [1, 2]]], [2, [1, [1, 2]]], [2, [2, [1, 2]]] BakerCampbellHausdor formula for ecient computation? (∑(ri + si ))−1 r1 s1 (−1) n+1 [X Y . . . X r log(e X e Y ) = ∑ ∑ n r1 !s1 ! . . . rn !sn ! n>0 r +s >0,1≤i ≤n 1 1 = X + Y + [X , Y ] + ([X , [X , Y ]] + [Y , [Y , X ]]) 2 12 I n i − i 1 1 [Y , [X , [X , Y ]]] − ... 24 720 Y s ] n Uniqueness s t characterizes Xs ,t more or less S (X ) , Theorem α, β : [s , t ] → Rd S (α)s ,t = S (β )s ,t if (Ben Hambly, Terry Lyons 2010) Let be two paths of bounded variation. Then and only if α ∗ β −1 is tree like. Given the signature, there is a unique path of bounded varation with minimal length. Inverting Signatures Inverting signatures is hard: Theorem (Terry Lyons, Weijun Xu, 2014): Using symmetrization, you can recover any C 1 path from its signature I Only uses the 2n n-th order iterated integrals Intuition I Consider an increasing 2d path (x (t ), y (t ))t ∈[0,1] , ẋ (t ), ẏ (t ) > 0 I Consider a Poisson process producing letters and y at rate y (t ) → W = xxyxyx . . . xy I P(W = w | |w | = n) ∝ Xs ,t (w ) x at rate ẋ (t ) Rotational Invarients of the signature Theorem (Joshca Diehl 2013) Rotational invariants for paths in 1 1 C (xx ) + C (xy ) 2 2 1 1 C (xy ) − C (yx ) 2 2 R2 displacement squared enclosed area Plus 3 of order 4. Plus 7 of order 6. Can be used for rotation invariant character recognition! Characters as paths Pretend the pen never left the writing surface. Normalize to get X : [0, 1] → [0, 1]2 . The signature truncated at level m, S (X )m 0,1 , has dimension 2 + 22 + · · · + 2m = 2m+1 − 2 1D: Consider a sliding window of truncated signature {S (X )m (i −k )/n,i /n : i = k , k + 1, ..., n} 2D: Calculate the sliding windows for each stroke. Draw them in a square grid. Signature of characters I Random forest I A classier composed of many random decision trees I Trees constructed iid and form a democracy I Each tree sees a random subset of the data. I Tree branches iteratively use individual features I From a random subset of features, the most informative feature is used to split the dataset roughly 50:50. 1000 trees. Error %: m 1 2 3 4 5 6 7 8 9 #Features 2 6 14 30 62 126 254 510 1022 Pendigits 47.5 18.7 7.5 6.0 4.4 3.7 3.4 3.1 3.2 Assamese 93.4 85.2 70.2 64.1 56.9 53.1 49.5 47.6 46.7 An ink dimension I Signatures are invariant w.r.t. tree-like excursions. I Useful to know if the pen is on the paper? I Solution: add a third dimension = ink used. 1000 trees. Error %: m 1 2 #Features 3 12 Pendigits 39.7 11.1 Assamese 89.8 69.9 3 39 6.1 56.8 4 120 3.7 48.1 5 363 2.5 42.3 ANN+translations: m 1 #Features 3 Assamese 87.1 3 39 39.9 4 120 28.8 5 363 21.9 2 12 64.7 6 1092 1.7 37.2 7 3279 1.5 34.2 Articial neural networks Directed weighted graph For each node: output=σ (b + ∑i w (i )input(i ) For classication, the nal layer is weighted to give a probability distribution. input∈ Ra hidden1 = σ (input · W1 + B1 ) ∈ Rb hidden2 = σ (hidden1 · W2 + B2 ) ∈ Rc hidden3 = σ (hidden2 · W3 + B3 ) ∈ Rd output= softmax(hidden3 · W4 + B4 ) ∈ Re 1 -4 0 #Parameters (a + 1) × b + (b + 1) × c + (c + 1) × d + (d + 1) × e 4 Boolean function If we normalize all the features, it looks a bit like we are dealing with Boolean functions, i.e. f : {0, 1}n → {0, 1} Building block Boolean functions x 0 1 NOT 1 0 x x y 0 0 1 1 0 1 0 1 x AND 0 0 0 1 y x OR 0 1 1 1 y AND b = σ (20a + 20b=30) OR b = σ (20a + 20b=10) NOT a = σ (−20a + 10) a a Shallow networks bad, deep networks good I MNIST: 1-layer NN, 12.0% (LeCun) I The XOR function cannot be represented by a 1-layer NN I Almost all Boolean functions have exponential circuit complexity (Shannon) I Some functions can be represented far more eciently by DEEP Boolean formulae. (x1 , ..., xn ) := x1 + +xn mod 2 O(log n) →size O(n) Fixed depth → size exponential in n I Parity I Depth I I (Håstad) In the brain there are ∼20 layers of neurons between seeing and recognizing. What do the layers do? ANNs training roughly does the following: I Find features that will be useful, i.e. the curve at the top of a 2 or 3. I Calculating how the features correspond to the dierent classes. H1 features weakly correlated with being a 2 or a 3. H2 look at how many hidden layer 1 features seem to indicate 2-ness. H3 look at how many hidden layer 2 features seem to indicate 2-ness. output weigh the evidence. I I I I Learning is actually top down Start with random weights Forward propagate input→output Errors at the top are back-propagated down (chain rule) What do the layers do? Ranzato, Boureau, LeCun Sparse features Convolutional Neural Networks I LeCun, Bottou, Bengio, Haner 1998 I Spatial pooling using "Max-Pool" I Shared weights within each layer I Easy to train I Spatial invariances encoded Representing pen strokes for CNNs Motivation: I The 8 × 8 × 8 grid for Chinese character recognition I Convolutional networks I Rough path theory Algorithm: I I I Normalized2 -strokesXi : [0, l (i )] → [0, 1]2 . Initialize a(2m+1 − 1) × k × k array to all zeros. CalculateS (Xi )m t −ε,t +ε , truncated at level I Put this information into the array. I Put the array into a CNN. m. Sparsity: most of the columns of the array are zero. Cost of evaluating the CNN greatly reduced. CASIA-OLHWDB1.1 I 3755 Chinese characters I 240 training samples I 60 test samples Results I 8x 8x 8 method and MQDF classier: 7.61% I CNN 5.61% (Ciresan et al) I Sparse Signature CNN 3.59% (G.) Convolutional architectures I Input 28x28x1 I 20 5x5 convolutional lters: 24x24x20 I 2x2 pooling: 12x12x20 I 50 5x5 convolutional lters: 8x8x50 I 2x2 pooling; 4x4x50≡800 I Fully connected layer: 500 I Output: #classes input-20C5-MP2-50C5-MP2-500N-output Ciresan, Meier and Schmidhuber I Input 48x48x1 I 100 3x3 convolutional lters: 46x46x100 I 2x2 pooling: 23x23x100 I 200 2x2 convolutional lters: 22x22x200 I 2x2 pooling; 11x11x300 I 300 2x2 convolutional lters: 10x10x300 I 2x2 pooling; 5x5x300 I 400 2x2 convolutional lters: 4x4x400 I 2x2 pooling; 2x2x400≡1600 I Fully connected layer 500 I Output: #classes input-100C3-MP2-200C2-MP2-300C3-MP2-400C2-MP2-500Noutput DeepCNets(l,k) I input- I (k)C3-MP2- I (2k)C2-MP2- I ... I (lk)C2-MP2- I (l+1)k N- I output Sparse DeepCNets I The convolutional and pooling operations can be memoized I Computation bottleneck becomes the top of the network I Normally the other way round! I GPU I 3000 MNIST digits/second I 200 Chinese characters/second CIFAR-10 I 50,000 training images, 10,000 test images. Color 32x32x3 I Kaggle competition I Top entry is 92.61% accuracy I Can you do better? (4 months to go) Frogs and Horses Frogs and Horses Frogs and Horses CIFAR-100 aquatic mammals sh owers food containers fruit and vegetables household electrical devices household furniture insects large carnivores large man-made outdoor things large natural outdoor scenes large omnivores and herbivores medium-sized mammals non-insect invertebrates people reptiles small mammals trees vehicles 1 vehicles 2 I I I beaver, dolphin, otter, seal, whale aquarium sh, atsh, ray, shark, trout orchids, poppies, roses, sunowers, tulips bottles, bowls, cans, cups, plates apples, mushrooms, oranges, pears, sweet peppers clock, computer keyboard, lamp, telephone, television bed, chair, couch, table, wardrobe bee, beetle, buttery, caterpillar, cockroach bear, leopard, lion, tiger, wolf bridge, castle, house, road, skyscraper cloud, forest, mountain, plain, sea camel, cattle, chimpanzee, elephant, kangaroo fox, porcupine, possum, raccoon, skunk crab, lobster, snail, spider, worm baby, boy, girl, man, woman crocodile, dinosaur, lizard, snake, turtle hamster, mouse, rabbit, shrew, squirrel maple, oak, palm, pine, willow bicycle, bus, motorcycle, pickup truck, train lawn-mower, rocket, streetcar, tank, tractor 100 catagories (on the right) 50,000 training images (i.e. 500/class), 10,000 test images, 32x32,3 25M parameters, 70.2% accuracy CIFAR 100 CIFAR 100 Cifar 100 ImageNet 2010 Example Neural Networks I Theano CNN - DeepCNet(4,20) I ann.py I OverFeat CUDANVIDIA C extension CUDA-Matrix Multiplication Tricks I Minibatches I When calculating gradients, don't use the while dataset. I Use a subset of size ∼100 I Much quicker I The noise in the gradients stop you getting stuck I Rectied Linear units I Positive part function I P(d ReLu(x )/dx = 1) ≈ 1/2 Tricks I Dropout I Deleting many of the hidden nodes during training forces the network to be more robust I Delete half of the hidden units during training (and maybe some of the input). I Halve W1,W2, . . . during testing to balance things out I Back propagation is adjusted accordingly. I Robust natural process that deletes 50% of the available data?? I Nestorov's Accellerated Gradient I A momentum method, similar to vt +1 = µ vt − ε∇f (θt ) θt +1 = θt + vt +1 but looking ahead: vt +1 = µ vt − ε∇f (θt + µ vt ) θt +1 = θt + vt +1