Searching for simple models 9th Annual Pinkel Endowed Lecture Institute for Research in Cognitive Science University of Pennsylvania Friday 20 April 2007 William Bialek Joseph Henry Laboratories of Physics, and Lewis-Sigler Institute for Integrative Genomics Princeton University http://www.princeton.edu/~wbialek/wbialek.html When we think about (and act upon) the things that we see (and hear, and ...), we put them into categories all images from Design Within Reach (http://dwr.com) g stools office chairs dining chairs benches lounge chairs In the dark of night, vision is based on signals (only) from the rod photoreceptor cells. 3 3 2.5 2.5 2 2 1.5 1.5 current (pA) current (pA) 3 1 2.5 0.5 2 0 1 0.5 0 1.5 current (pA) !0.5 !1 !0.5 1 !1 0.5 !1.5 !1 !1.5 0 1 2 3 4 5 6 0 7 !1 0 1 time re flash (s) 2 3 4 5 6 7 5 6 7 5 6 7 5 6 7 5 6 7 time re flash (s) !0.5 consider the responses to dim flashes of light !1 3 3 !1.5 2.5 !1 0 1 2 3 4 5 6 2.5 7 time re flash (s) 2 2 1 2.5 0.5 2 0 !0.5 !1 !1.5 !1 0 1 2 3 4 5 6 !0.5 1 !1 0.5 !1.5 0 7 1 0.5 0 1.5 current (pA) 3 1.5 3 current (pA) current (pA) 1.5 !1 0 1 time re flash (s) 2 3 4 time re flash (s) !0.5 2.5 3 !1 2.5 !1.5 !1 3 2.5 0 1 2 3 4 5 6 7 time re flash (s) 2 2 1.5 1.5 3 current (pA) current (pA) 2 1 2.5 0.5 2 0 1 0.5 0 1.5 1.5 current (pA) !0.5 !1 !1 0 1 2 3 4 5 6 7 1 !1 0.5 !1.5 !1 0 time re flash (s) 0 1 2 3 4 time re flash (s) !0.5 1 3 !1 2.5 !1.5 !1 3 2.5 0 1 3 4 5 6 7 time re flash (s) 2 0.5 2 2 1.5 1.5 3 current (pA) current (pA) current (pA) !1.5 !0.5 1 2.5 0.5 1 0.5 2 0 0 0 1.5 !0.5 current (pA) !0.5 !1 !1.5 !1 0 1 2 3 4 5 6 1 !1 0.5 !1.5 7 !1 !0.5 1 3 !1 2.5 !1.5 4 3 1 2 3 4 5 6 7 time re flash (s) 2 1.5 3 1 2.5 0.5 2 0 2 3 4 time re flash (s) 5 6 7 !1 !1.5 !1 0 1 2 3 4 time re flash (s) 5 6 7 current (pA) 1 1 0.5 0 1.5 !0.5 0 current (pA) current (pA) 1.5 !1.5 !0.5 1 !1 0.5 !1.5 !1 0 1 !1 !1.5 !1 2 3 4 time re flash (s) 0 !0.5 0 1 2 3 4 5 6 7 time re flash (s) salamander rods (not that it really matters) rod image and current data from FM Rieke salamander image from MJ Berry II 3 2.5 0 2 !1 ~ 20 microns 2 time re flash (s) !0.5 !1 !1 0 0 time re flash (s) The brain has the problem of categorizing these responses! Remember Hecht, Shlaer & Pirenne! Energy, quanta and vision, J Gen Physiol 25, 819 (1942) 1 0.9 Hecht categories of rod cell response should correspond to zero, one, two ... photons Shlaer Pirenne 0.8 K=6 try to categorize based simply on current at peak time 0.8 0.6 0 photons 0.7 0.5 0.4 0.6 0.3 0.2 0.1 0 0 10 1 10 (inferred) mean number of photons at the retina probability density (1/pA) probability of seeing 0.7 0.5 1 photon 0.4 0.3 ? 0.2 x-axis is proportional to light intensity of stimulus flash solid line is model where observer “sees” when more than K photons are counted at the retina ... distribution of counts determined by physics of the light source in this regime, our visual perception is controlled by the random arrival of individual photons at the retina 2 photons ? 0.1 0 !1 !0.5 0 0.5 1 1.5 2 current at tpeak (pA) 2.5 3 3.5 4 this gives the right idea, but simplest approach leaves substantial ambiguities (?) is there a better strategy? Raw data for categorization is the current I(t) many time points = many dimensions better categories = better boundaries in this multi-dimensional space predicted bipolar response measured bipolar response 0 try planar boundaries: decisions with output of a linear filter best filter determined by signal and noise properties of the rods themselves 1.4 normalized voltage measured rod (voltage) response 0 photons probability density (1/pA) 1.2 0.0 optimal filter resolves almost all ambiguity 1 if there is a unique optimal filtering strategy for processing the rod cell signals, the retina should use this strategy ... this is a parameter free prediction! 1 photon 0.4 2 photons 0.2 0 !1 !0.5 0 0.5 1 1.5 2 current (pA) 1.0 time after light flash (seconds) 0.8 0.6 0.5 2.5 3 3.5 4 Optimal filtering in the salamander retina. F Rieke, WG Owen & W Bialek, in Advances in Neural Information Processing 3, R Lippman, J Moody & D Touretzky, eds, pp 377-383 (Morgan Kaufmann, San Mateo CA, 1991). categorizing rod responses might be analogous to categorizing images of chairs but sometimes “simpler” animals actually have to solve the same problem that we do not as different as Mr Larson thinks place a small wire in the back of the fly’s head to “listen in” on the electrical signals from nerve cells that respond to movement Spikes: Exploring the Neural code F Rieke, D Warland, RR de Ruyter van Steveninck & WB (MIT Press, 1997) The fly has to solve (at least) two problems: ! estimate motion from the “movie” on the retina, and represent or encode the result in the sequence of spikes Optimization principles (as with optimal filtering above): focus will be on extracting a feature estimates as accurate as possible rather than building its representation, i.e., estimation theory not information theory coding in spikes should be matched to input signals (maybe a mistake for this talk) does the fly make accurate estimates of motion? we can get at this by decoding the spikes from the motion sensitive neurons then look at the power spectrum of errors in the reconstructed signal F (−τ ) S(ω) N (ω) Nmin (ω) vest (t) = ! compare with the minimum level of errors set by diffraction blur and photoreceptor noise F (t − ti ) i Reading a neural code WB, F Rieke, RR de Ruyter van Steveninck & D Warland Science 252, 1854 (1991) (includes averaging over ~3000 receptors!) fly approaches optimal estimation on short time scales (high frequencies) that actually matter for behavior What computation must the fly do in order to achieve optimal motion estimates? motion estimation takes photoreceptor signals as inputs (not velocity!) after several layers of processing ... output is an estimate of velocity we can actually solve the optimal estimation problem in some limiting cases Statistical mechanics and visual signal processing M Potters & WB, J Phys I France 4, 1755 (1994) (also need hypotheses about the statistical structure of the world) at high signal-to-noise ratios, velocity is just the ratio of temporal and spatial derivatives at low signal-to-noise ratios, the only reliable velocity signal is spatiotemporal correlation vest (t) ≈ vest (t) ≈ optimal estimation always is a tradeoff between systematic and random errors (think about averaging over time in the lab!) the optimal estimator is not perfect ... and can even “see” motion when nothing moves visual stimuli from RR de Ruyter van Steveninck (flies see it move too!) can go further with random stimuli to dissect the computation Features and dimensions: Motion estimation in fly vision WB & RR de Ruyter van Steveninck, http://arXiv.org/q-bio/0505003 (2005) !" ij ! · (V i (dVi /dt) ! i constant + dτ " − Vi+1 ) ∂V /∂t → 2 ∂V /∂x i (Vi − Vi+1 ) dτ ! Vi (t − τ )Kij (τ, τ ! )Vj (t − τ ! ) + · · · Almost everything interesting that the brain does involves LOTS of neurons How do we think about these networks as a whole? Imagine slicing time into little windows, like the frames of a movie. In each frame, each cell either spikes, or it doesn’t. New experimental methods make it possible to “listen in” on many neurons at once (MJ Berry II). a moment ago now a moment later neuron # 1 spike no spike no spike neuron # 2 no spike spike no spike neuron # 3 spike no spike spike neuron # 4 no spike no spike spike neuron #5 no spike spike no spike neuron # 6 no spike no spike spike neuron # 7 no spike spike no spike neuron # 8 spike no spike no spike neuron # 9 no spike no spike no spike neuron # 10 no spike spike spike 1010000100 0100101001 0011010001 states of the network: these cells from a salamander retina (not crucial, but cute) These are the “words” with which the retina tells the brain what we see! How big is the vocabulary? 10 cells = 1024 possible words 100 cells = 1,267,650,600,228,229,451,434,304,733,184 possible words An important insight from theory: If it really is as complicated as it possibly could be, you’ll never understand it. Progress = Simplification Simplifying hypothesis #3: cells cooperate, but only “talk to each other” 2 by 2 ... collective actions emerge from all pairwise discussions. Simplifying hypothesis #1: every cell does its thing independently of all the others. works surprisingly well if you look just at pairs of cells # of cell pairs 0 10% correlation 1 independent model experiments! probability fails dramatically if you look at 40 cells (or in detail at 10 cells) 1/100 1/10,000 one in a million (C Boutin & J Jameson, Princeton Weekly Bulletin , May 1, 2006) 0 5 10 15 # of cells that spike together Simplifying hypothesis #2: if 10 cells spike together, there must be something special about those 10 cells ... (actually, not a simplification). Take seriously the weak correlations among many pairs (compare w/cortex!). Build the least structured model consistent with these correlations. (least structured = maximum entropy) Weak pairwise correlations imply strongly correlated network states in a neural population E Schneidman, MJ Berry II, R Segev & WB, Nature 440, 1007 (2006). The model we are looking for (minimal structure to match the pairwise correlations) is exactly the Ising model in statistical mechanics. σi = +1 neuron i fires a spike σi = −1 neuron i is silent state of entire network: σ1 , σ2 , · · · , σN ≡ {σi } distribution of network states (words) # 1 1# P (σ1 , σ2 , · · · , σN ) = exp hi σ i + Jij σi σj Z 2 i for N=10 we can check the whole distribution! i!=j entropy if N neurons are independent Max ent model captures ~90% of the structure Entropy scale max ent given pair correlations actual entropy independent neurons (suggested by weak correlations) pairwise (Ising) model But this is also the Hopfield model of memory! Are there “stored patterns”? Moving to larger networks opens a much richer structure! with N=40 neurons we have multiple ground states (=stored patterns) network returns to the same basin of attraction when we play the movie again, even if microstate is different observed groups of 20 cells are typical of ensembles generated by drawing means and correlations at random from observed distribution ... suggests extrapolation to larger N is the system poised near a critical point? (test via adaptation expt!) model from real data can be thought of as being at temperature T=1 study specific heat vs T (integrate to get entropy) Ising models for networks of real neurons G Tkacik, E Schneidman, MJ Berry II & WB, http://arXiv.org/q-bio.NC/0611072 (2006) How seriously should you take these maximum entropy models? Can we use them to describe more complex phenomena? Try words as networks of letters ! one year of Reuters’ articles (~90 million words) choose 5000 most common words select the subset of four letter words (626 distinct words) 4 full probability distribution has (26) ~ half a million elements max ent model consistent with pairwise correlations has ~ 6x(26) 2 ~ 4000 parameters, 100x fewer (!) recall that spelling rules have a very “combinatorial” feel ... if all letters used equally, entropy = 4xlog2(26) = 18.80 bits taking account of letter frequencies, entropy of independent letters = 14.59 bits entropy of actual distribution = 7.42 bits so, multi-information = 7.17 bits max-ent model captures 6.23 bits, or ~87% of the structure inevitably, the model assigns nonzero probability to words not seen in the data .... (remember the data are limited to 5000 most common words) rite, hove, rase, lost, hive, mave, wark, whet, lise, tame, leat, fave, tike, pall, meek, nate, mast, hale, sime, gave, tome, ... Toward a statistical mechanics of four letter words, GJ Stephens & WB, in progress (2006). What problem is the brain solving? How do neurons cooperate in networks? classification (e.g., rod responses) estimating a feature (e.g., motion) ... in many (simple) cases, there is evidence for near-optimal performance (many examples I didn’t discuss) common observation is that pairs of neurons are only weakly correlated or anti-correlated if we take this seriously, we have a theory of what the brain should compute; key qualitative prediction is context dependence but: why these features? (laundry list problem) is there a unifying theme for the problems that the brain solves well? are all the problems really the problem of prediction? but there are LOTS of pairs from (simple) statistical mechanics models: if all pairs interact, weak is <<1/N, not <<1 minimally structured models consistent with “weak” correlations predict dramatic collective states: memories, critical points ... (exotica implied by modest phenomena!) how do we connect network dynamics and computational function? maybe: predictive information is maximal at critical points ...