How to win big by thinking straight about relatively trivial problems Tony Bell University of California at Berkeley Density Estimation Make the model like the reality by minimising the Kullback-Leibler Divergence: by gradient descent in a parameter of the model THIS RESULT IS COMPLETELY GENERAL. : The passive case ( = 0) For a general model distribution written in the ‘Energy-based’ form: energy partition function (or zeroth moment...) the gradient evaluates in the simple ‘Boltzmann-like’ form: learn on data while awake unlearn on data while asleep The single-layer case Shaping Density Many problems solved by modeling in the transformed space Linear Transform Learning Rule (Natural Gradient) for non-loopy hypergraph The Score Function is the important quantity Conditional Density Modeling To model use the rules: This little known fact has hardly ever been exploited. It can be used instead of regression everywhere. Independent Components, Subspaces and Vectors ICA ISA IVA DCA (ie: score function hard to get at due to Z) IVA used for audio-separation in real room: Score functions derived from sparse factorial and radial densities: Results on real-room source separation: Why does IVA work on this problem? Because the score function, and thus the learning, is only sensitive to the amplitude of the complex vectors, representing correlations of amplitudes of frequency components associated with a single speaker. Arbitrary dependencies can exist between the phases of this vector. Thus all phase (ie: higher-order statistical structure) is confined within the vector and removed between them. It’s a simple trick, just relaxing the independence assumptions in a way that fits speech. But we can do much more: • build conditional models across frequency components • make models for data that is even more structured: Video is [time x space x colour] Many experiments are [time x sensor x task-condition x trial] channel 1-16, time 0-8 channel 17-32, time 0-8 channel 1-16, time 0-8 channel 1-16, time 0-1 channel 17-32, time 0-1 channel 1-16, time 0-1 channel 17-32, time 0-1 channel 33-48, time 0-1 The big picture. Behind this effort is an attempt to explore something called “The Levels Hypothesis”, which is the idea that in biology, in the brain, in nature, there is a kind of density estimation taking place across scales. To explore this idea, we have a twofold strategy: 1. EMPIRICAL/DATA ANALYSIS: Build algorithms that can probe the EEG across scales, ie: across frequencies 2. THEORETICAL: Formalise mathematically the learning process in such systems. A Multi-Level View of Learning LEVEL UNIT DYNAMICS LEARNING ecology society predation, symbiosis natural selection society organism behaviour sensory-motor learning organism cell protein cell spikes synaptic plasticity protein direct, voltage, Ca, 2nd messenger molecular change molecular forces gene expression, protein recycling amino acid ( = STDP) Increasing Timescale LEARNING at a LEVEL is CHANGE IN INTERACTIONS between its UNITS, implemented by INTERACTIONS at the LEVEL beneath, and by extension resulting in CHANGE IN LEARNING at the LEVEL above. Interactions=fast Learning=slow Separation of timescales allows INTERACTIONS at one LEVEL to be LEARNING at the LEVEL above. 1 Infomax between Layers. (eg: V1 density-estimates Retina) y 2 Infomax between Levels. (eg: synapses density-estimate spikes) t V1 all neural spikes synapses, dendites synaptic weights x retina • square (in ICA formalism) • feedforward • information flows within a level • predicts independent activity • only models outside input This SHIFT in looking at the problem alters the question so that if it is answered, we have an unsupervised theory of ‘whole brain learning’. y all synaptic readout • overcomplete • includes all feedback • information flows between levels • arbitrary dependencies • models input and intrinsic activity pdf of all synaptic ‘readouts’ pdf of all spike times If we can make this pdf uniform then we have a model constructed from all synaptic and dendritic causality Formalisation of the problem: IF p is the ‘data’ distribution, q is the ‘model’ distribution w is a synaptic weight, and I(y,t) is the spike synapse mutual information THEN if we were doing classical Infomax, we would use the gradient: (1) BUT if one’s actions can change the data, THEN an extra term appears: (2) It is easier to live in a world where one can change the world to fit the model, as well as changing one’s model to fit the world therefore (2) must be easier than (1). This is what we are now researching.