Neural Information Coding: Biological Foundations, Analysis Methods and Models Stefano Panzeri & Alberto Mazzoni Italian Institute of Technology Genova, Italy stefano.panzeri@iit.it – alberto.mazzoni@iit.it Contributors Nikos Logothetis Christoph Kayser Yusuke Murayama Cesare Magri Robin Ince Nicolas Brunel Neural Coding 1: biological foundations Neuronal Coding • How do neurons encode and transmit sensory information? • What is the language (“code”) used by the neurons to transmit information to one another? Why is it important for BMI to understand how to decode a neural signal? Nicolelis and Lebedev (2009) Nature Reviews Neuroscience What is a code? A code is transformation of a certain message by using another alphabet For example, you can use your fingers to represent numbers Use only thumb to code 0 or 1 2 fingers are used to encode 3 different numbers by the total amount of extended fingers 2 fingers are used with a more complex code to encode 4 different numbers. Here the position of the extended finger is also used to signal – this increases the capacity to encode information There are only 10 kind of people in the world: those who know the binary code and those who don’t. Morse code Telegrams were transmitted by electrical signals over wires by using Morse codes. Morse codes are a correspondence between the pattern of electrical signals and the characters of the English alphabet. There are two types of “symbols”: long (_) and short (.) electrical pulses. Not only the length of the individual signal is important, but also the timing of the individual signals (> for characters between different words) The neuronal code is a sequence of spikes Somatic electrode – subthreshold membrane potential plus Action Potentials (spikes) Extracellular electrode Axonal electrode– subthreshold membrane potential are attenuated and only spikes propagate long distance The synaptic vesicles in the axon terminal release neurotransmitters only when an action potential arrives from the presynaptic neuron Thus, neurons communicate information only through action potentials, not through subthreshold membrane fluctuations. Thus, the neuronal code consists of a time series of stereotyped action potentials … but on a BMI perspective, the code is LFP spikes ECoG EEG fMRI Spike trains encode stimuli Single Neuron Variability Trial 1 Trial 2 Trial 3 Trial 4 Decoding brain activity ?? ? Arieli et al (1996) Science A response to an individual trial can be modeled as a sum of two components: the reproducible response and the ongoing network fluctuations The effect of a stimulus may be likened to the additional ripples caused when throwing a stones in a wavy sea Response to multiple presentations of the same movie clip Single unit activity - 10 secs Local Field Potential - 10 secs The code must be composed by features that are DIFFERENT from scene to scene and SIMILAR across trials The dictionary for neural code is noisy… stimulus s() response r(t) Thus, the neuronal dictionary is probabilistic! Probabilistic dictionary Distribution of barometer readings predicting rain of occurrence Frequency Probability 0.08 Distribution of barometer readings when it does not rain 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0 10 29 20 30 40 29.5 30 30.5 Response (spikes/sec) Pressure Hg) Response(inches (spikes/s) 50 Neural Coding 2: analysis methods Single trial analysis toolboxes (Information Theory and/or Decoding) Decoding: predicting the most likely external correlate from neural activity Information: quantifies the total knowledge (in units of bits) about external correlates that can be gained from neural activity (1 bit = doubling the knowledge) Quian Quiroga & Panzeri (2009) Nature Reviews Neuroscience What is Information Theory? Information theory is a branch of mathematics that deals with measures of information and their application to the study of communication, statistics, and complexity. It originally arose out of communication theory and is sometimes used to mean the mathematical theory that underlies communication systems and communication in the presence of noise. Based on the pioneering work of Claude Shannon Shannon, C. E. (1948). A mathematical theory of communication. Bell Sys. Tech. Journal 27:379-423, 623-656. TM Cover and JA Thomas, Elements of Information Theory, JohnWhiley 2006 PE Latham and Y Roudi (2008) Mutual Information – Scholarpedia article (freely accessible) R Quian Quiroga and S Panzeri (2009) Extracting information from neuronal populations: informaiton theory and decoding approaches. Nature Reviews Neurosci Entropy of a random variable S Suppose we want to gain information about the stimulus If there is only one possible stimulus, then no information is gained by knowing that stimulus. The amount of information obtained from knowing something is related to its INITIAL UNCERTAINTY Therefore the first step to define information is to quantify uncertainty. The entropy H of a random variable s is a simple function of its probability distribution P(s): H ( S ) P( s) log 2 P( s) s H(S) measures the amount of uncertainty inherent to a random variable, i.e. the amount of information necessary to describe it completely. 24 1 1 H (GuessWho) log 2 4.585 24 1 24 Warning about guess who I will use Guess Who as the stupidest application of Information equations. But there is no noise, no biological variability in the game. Decoding neurons activity is more like playing a version of the game in which the faces are probabillistic (X has an hat only P(X)% of the times you look), and you are playing against a guy who lies and talks in a language you don’t really know well during a noisy Skype conference. Entropy and description length H(S) has a very concrete interpretation: Suppose s is chosen randomly from the distribution P(s) , and someone who knows the distribution is asked to guess which was chosen. If the guesser uses the optimal question-asking strategy -- which is to divide the probability in half on each guess by asking questions like "is s greater than .. ?", then the average number of yes/no questions it takes to guess lies between H(S) and H(S)+1. This gives quantitative meaning to "uncertainty": it is the number of yes/no questions it takes to guess a random variables, given knowledge of the underlying distribution and taking the optimal question-asking strategy H(Guess Who)=4.585 So you really should win with 5 questions…. Demonstration is left to the audience Entropy of the stimulus distribution H ( S ) P( s) log 2 P( s) s H(S) expresses the uncertainty about the stimulus (or equivalently, the information needed to specify the stimulus) prior to the observation of a neural response Residual entropy (equivocation) of the stimulus after observing the neural response H ( S | R) P(r ) P( s | r ) log 2 P( s | r ) r s H(S|R) expresses the residual uncertainty about the stimulus after the observation of a neural response. “Are you a woman?” P(yes)=19/24; P(no)=5/24; P(face|yes)= 0 or 1/19; P(face|no)= 0 or 1/5 H(Guess Who|woman)=3.97; Questions left for optimal strategy: at most 4 Shannon’s Information s I ( R; S ) H ( R) H ( R | S ) I ( S ; R) H ( S ) H ( S | R) r Shannon’s Information : it quantifies the average reduction of uncertainty (=information gained) after a single-trial observation of the neural response Measured in bits. 1 bit = reduction of uncertainty by a factor of 2 (like a correct answer to a yes/no question) I(Guess Who|woman) = 4.58 - 3.97 = 0.61 bits ; I(Real World|woman) = 1 bit; Information key equations I ( R; S ) H ( R) H ( R | S ) I ( S ; R) H ( S ) H ( S | R) H ( R) H ( R | S ) I ( R; S ) P(r | s ) I ( S ; R) P( s) P(r | s) log 2 P(r ) s r R Stim 1 Stim 2 R Stim 1 Stim 2 Trial 1 10 5 Trial 1 7 5 Trial 2 9 3 Trial 2 5 3 Trial 3 8 5 Trial 3 8 5 Trial 4 10 4 Trial 4 4 4 I(S;R) =1 bit I(S;R) =0.045 bit Information analysis with dynamic stimuli P (r ) P(r | s A ) P(r | sB ) Trials P(r | s) I ( S ; R) P( s) P(r | s) log 2 P(r ) r ,s sA sB 1 sec Information about which section of the dynamic stimulus elicited the considered neural response . Since this procedure does not make assumptions about which movie feature is being encoded, it quantifies information about all possible visual attributes in the movie Data processing inequality I ( R; S ) H ( R) H ( R | S ) I ( S ; f ( R)) I ( S ; R) this means that you will not increase information by decimating or squaring or taking the logarithm of your data, but does not mean that you can not filter off noise (or extract spikes from raw data) , because I ( S ; R1 R 2)) ? I ( S ; R1) since R1 R 2 f ( R1) Combining different responses I ( R; S ) H ( R) H ( R | S ) P (r1 , r2 | s ) I ( S ; R1 , R2 ) P ( s ) P (r1 , r2 | s ) log 2 P (r1 , r2 ) s r 1, r 2 R1 Stim 1 Stim 2 Stim 3 Trial 1 10 5 4 Trial 2 9 3 5 Trial 3 8 5 5 Trial 4 10 4 3 R2 Stim 1 Stim 2 Stim 3 Trial 1 3 1 9 Trial 2 1 0 8 Trial 3 0 3 8 Trial 4 2 2 9 I ( S ; R1 , R2 ) max( I ( S ; R1 ), I ( S ; R2 )) Redundancy I ( R; S ) H ( R) H ( R | S ) RED ( S ; R1 , R2 ) I ( S ; R2 ) I ( S ; R1 ) I ( S ; R1 , R2 ) R1 Stim 1 Stim 2 Stim 3 Trial 1 10 5 7 Trial 2 9 3 5 Trial 3 8 5 6 Trial 4 10 4 7 R2 Stim 1 Stim 2 Stim 3 Trial 1 3 1 9 Trial 2 1 0 8 Trial 3 0 3 8 Trial 4 2 2 9 Redundancy>0 Decoding (stimulus reconstruction) srs p Stimulus presented decoding Stimulus predicted Quian Quiroga & Panzeri Nature Reviews Neurosci 2009 Decoding algorithm using Bayes rule P ( s ') P (r | s ') P( s ' | r ) P(r ) To predict the stimulus sp that generated a given response r, we can chose the stimulus that maximizes the posterior probability of have caused r s p arg max P ( s ' | r ) s' Decoding using clustering algorithms Divide the response space into regions. When the response falls in a given region, assign it to a stimulus Optimal boundaries are difficult to find, ask Iannis…. Decoding using neural networks Kjaer et al (1994) J Comput Neurosci Information in the confusion matrix p P ( s , s ) p p I ( S ; S ) P ( s, s ) log 2 p p P ( s ) P ( s ) s,s Stimulus presented s I(S;Sp) quantifies the information than can be gained by a receiver that knows the true P(s|r) but only uses the information about which is the most likely stimulus Stimulus predicted sp Because of data processing inequality: I ( S ; S ) I ( S ; R) p Advantages of Shannon’s information • Takes into account all possible ways in which a variable can carry information about another variable • More complete than simply decoding (quantifies all sources of knowledge) • Makes no assumption about the relation between stimulus and response (we do not need to specify which features of the stimulus activate the neuron and how they activate it) Quian Quiroga & Panzeri (2009) Nature Reviews Neuroscience Neurons may convey information in other ways than reporting the most likely stimulus If the receiver only uses the information about which is the most likely stimulus, it may miss out on important information even if it operates on the true P(s|r). Quian Quiroga & Panzeri Nature Reviews Neurosci 2009 Advantages of information theory • It quantifies single-trial stimulus discriminability on a meaningful scale (bits) • It is a principled measure of the correlation on a single trial basis between neuronal responses and stimuli • It works even in non-linear situations • It makes no assumption about the relation between stimulus and response – we do not need to specify which features of the stimulus activate the neuron and how they activate it • Since it takes into account all ways in which a neuron can reduce ignorance, it bounds the performance of any biological decoder. Thus it can be used to explore which spike train features are best for stimulus decoding, without the limitations coming from committing to a specific decoding algorithm Advantages of decoding techniques • Decoding algorithms are much easier to implement and compute • Calculations are robust even with limited amounts of data • Allow to test specific hypotheses on how downstream systems may interpret the messages of other neurons • More direct implementation to BMI Neural Coding 3: BIAS The “plug-in” information estimation The most direct way to compute information and entropies is the “plug-in” method: estimate the response probabilities P(r|s) and P(r) as the experimental histogram of the frequency of each response across the available trials and then plug these empirical probability estimates into entropy Equations. H ( S | R) P(r ) P( s | r ) log 2 P( s | r ) r s The limited sampling bias The problem with the plug-in estimation is that these probabilities are not known but have to be measured experimentally from the available data. The estimated probabilities are subject to statistical error and necessarily fluctuate around their true values. The significance of these finite sampling fluctuations is that they lead to both systematic error (bias) and statistical error (variance) in estimates of entropies and information. These errors, particularly the bias, constitute the key practical problem in the use of information with neural data. If not corrected, bias can lead to serious misinterpretations of neural coding data. Panzeri, S., Senatore, R., Montemurro, M. A., and Petersen, R. S. (2007). Correcting for the sampling bias problem in spike train information measures. J Neurophysiol 98, 1064-1072. Review paper S Panzeri, C Magri, L Carraro (2008) Sampling bias – Scholarpedia article (freely accessible) Statistics of information values: bias and variance due to limited sampling The bias of a given functional F of a probability distribution P is defined as the difference between the trial-averaged value of F when the probability distributions are computed from N trials only and the value of F computed with the true probability distributions (obtained from infinite number of trials). Ince et al (2010) Frontiers Neurosci Panzeri et al (2007) J Neurophysiol The limited sampling bias 1 2 The limited sampling bias 1 2 The Information Bias – Asymptotic expansion BIAS ( I ) I ( PN ) N C1 C2 I ( P ) .... 2 N 2N This expansion converges only if N is large (large enough that each response with non-zero probability is observed several times) Note: All coefficients are non-negative Miller, 1955 Treves & Panzeri 1995 Panzeri & Treves, Network 1996 Victor, Neur. Comp. 2000 Paninski, Neur. Comp. 2003 The Information Bias – Leading coefficient H ( R | S ) P( s) P(r | s) log 2 P(r | s) s Bias ( H ( R | S )) r 1 Rs 1 2 N ln( 2) s H ( R) P(r ) log 2 P(r ) Rs= size of support of P(r|s) # of responses with Prob > 0 of being observed upon presentation of s r 1 R 1 Bias ( H ( R)) 2 N ln( 2) R= # of responses with > 0 of being observed upon presentation of any s Entropies are negatively biased – H(R|S) far more biased than H(R) Procedures to alleviate – eliminate the bias • Bootstrapping and subtracting • Quadratic Extrapolation • Panzeri Treves correction • Shuffling (for 2 dimensions or more) The boostrapping and subtracting method One way to estimate the bias is to randomly “bootstrap” stimulus-response data pairs. The resulting information Iboot should be zero. In practice, because of finite sampling, we will find Iboot > 0. The mean of the distribution of Iboot is equal to the bias of Iboot. Stim 1 Stim 2 Stim 3 Stim 1 Stim 2 Stim 3 Trial 1 4 3 1 Trial 1 1 3 4 Trial 2 3 2 1 Trial 2 3 1 2 Trial 3 3 2 1 Trial 3 1 3 2 Trial 4 3 3 2 Trial 4 3 3 2 Trial 5 5 2 1 Trial 5 1 5 2 Tot 18 12 7 Tot 9 15 12 boott The boostrapping and subtracting method Bias ( I ) 1 Rs 1 R 1 2 N ln( 2) s Bias ( I boot ) 1 Rs boot 1 Rboot 1 2 N ln( 2) s Bootstrapping increases the support of P(r,s) Bias of Iboot is typically higher than the bias of I bootstrapped real r r s s The Quadratic Extrapolation C1 C2 I N I true .... 2 N 2N -) Collect N data -) Divide then into blocks of N/2 , N/4 … -) Compute average information for N , N/2 , N/4 ... Data and fit the dependence of I on N to the above quadratic expression -) Estimate the true (N=∞) value from the best-fit expression The Panzeri-Treves bias estimate Correlation and bias With multi-dimensional codes, there is extra noise added by the correlations between the responses ... + Very bad sampling because num of parameters scales exponentially with L 1) Compute Hind(R|s) with independent probabilities for dimension 2) Compute Hsh(R|S): shuffling randomly trial order at fixed time and stimulus condition, then compute entropy of shuffled distribution R1, R2 Stim 1 Stim 2 Trial 1 10, 3 6, 4 8, 5 Trial 2 8, 5 8, 3 6, 3 Trial 3 9, 4 7, 5 R1, R2 Stim 1 Stim 2 Trial 1 10, 5 7, 4 Trial 2 9, 4 Trial 3 8, 3 SH I sh ( S ; R ) H ( R ) H ind ( R | S ) H sh ( R | S ) H ( R | S ) The Shuffling subtraction method Bias ( I ) 1 Rs 1 R 1 2 N ln( 2) s Bias ( I sh ) 1 Rs sh 1 Rsh 1 2 N ln( 2) s Shufflig increases the support of P(r,s) Bias of Ish is typically higher than the bias of I shuffled real r r s s Sampling rules of thumb for I(S;R) The previous results suggest rules of thumb for data sampling (i.e. the number of trials per stimulus Ns) necessary to obtained unbiased information estimates: If no bias correction is used: Ns >> number of possible responses R If we use an appropriate bias correction, then the constraint on data becomes much better: Ns ≥ number of possible responses R Submitted last week…… We are always trying to improve!!! Toolbox www.ibtb.org