Neural encoding of sound stimuli A mutual information analysis of EEG signals Date: 06/08/2012 Author: Adrian Radillo - Intern at the Institute of Digital Healthcare Supervisor: Dr. James Harte IDH – Project Report Measuring the brain’s response to sound stimuli Contents Introduction............................................................................................................................................. 3 1 2 Computing mutual information from EEG signals ........................................................................... 3 1.1 Mutual information ................................................................................................................. 3 1.2 EEG signal recorded in response to sound stimuli .................................................................. 3 1.3 Common meanings of the mutual information in neuroscience ............................................ 3 1.4 Formalism ................................................................................................................................ 4 1.4.1 Stimuli .............................................................................................................................. 4 1.4.2 Response ......................................................................................................................... 4 1.4.3 Empirical data and probability estimates ........................................................................ 5 1.4.4 Calculations ..................................................................................................................... 5 Two examples .................................................................................................................................. 6 2.1 Magri et al., 2009 .................................................................................................................... 6 2.2 Cogan & Poeppel, 2011 ........................................................................................................... 6 3 Workflow ......................................................................................................................................... 7 4 Interpreting mutual information for sensory neuroscience ........................................................... 9 5 Another use of Mutual Information: Consonants perception ...................................................... 10 5.1 Variables ................................................................................................................................ 10 5.2 Experiment ............................................................................................................................ 11 5.3 Data analysis .......................................................................................................................... 11 5.4 Results ................................................................................................................................... 12 5.5 Personal remarks ................................................................................................................... 13 Conclusion ............................................................................................................................................. 13 Bibliography........................................................................................................................................... 14 Appendix A: Information Theory ........................................................................................................... 15 Appendix B: Information Breakdown toolbox....................................................................................... 29 Appendix C: Matlab Codes .................................................................................................................... 31 Adrian Radillo 2 IDH – Project Report Measuring the brain’s response to sound stimuli Introduction Our project’s aim is to answer the following question: How can the notion of mutual information be used to interpret EEG data, in studies about the auditory system? After having defined the notion of mutual information and provided some knowledge about EEG signals, we present the admitted theoretical framework that makes use of mutual information in auditory neuroscience. In particular, we present two studies (Magri, Whittingstall, Singh, Logothetis, & Panzeri, 2009; Cogan & Poeppel, 2011) and their common workflow. Then, we take a look at another use of mutual information concerning consonant perception. Eventually, we discuss substantive questions about mutual information such as its scope and limitations in the neuroscience of the auditory system. 1 Computing mutual information from EEG signals 1.1 Mutual information We provide in this section only condensed information about the mutual information. Therefore, the reader is invited to consult the Appendix A for detailed information on the subject. Let π and π be two discrete random variables with finite range. Then the mutual information between these two random variables is denoted by πΌ(π , π) and represents “…the decrease of uncertainty [of π] due to the knowledge of π , or as the information about π which can be gained from the value of π ” (Rényi, 2007, p. 558). Moreover mutual information is a symmetric quantity. That is to say, “…the value of π gives the same amount of information about π as the value of π gives about π .” (Rényi, 2007, p. 558) 1.2 EEG signal recorded in response to sound stimuli An EEG signal is usually made of time-varying voltage in several electrodes (or channels) in parallel. It is possible to extract several discrete and finite quantities from each one of these channels. Common variables are the amplitude, the frequency and the phase of the signal. As far as the frequency is concerned, it is common to distinguish the following frequency bands: the delta band 0.5-4 Hz, the theta band 4-7.5Hz, the alpha band 8-13Hz and the beta band 14-26Hz (Sanei & Chambers, 2007, pp. 10–11). In what follows, we pay a special attention to the phase in the delta and theta bands, as it has proved to “play an important role in auditory perception” (Cogan & Poeppel, 2011, p. 554). 1.3 Common meanings of the mutual information in neuroscience For a finite and discrete valued stimulus S, and a corresponding finite and discrete valued neural response R, we present common meanings in the literature for the quantity I(S,R), which is the mutual information of the stimulus and the response: ο· “Responses that are informative about the identity of the stimulus should exhibit larger variability for trials involving different stimuli than for trials that use the same stimulus Adrian Radillo 3 IDH – Project Report Measuring the brain’s response to sound stimuli repetitively. Mutual information is an entropy-based measure related to this idea.” (Dayan & Abbott, 2001, p. 126) ο· “The mutual information measures how tightly neural responses correspond to stimuli and gives an upper bound on the number of stimulus patterns that can be discriminated by observing the neural responses.” (Schneidman, Bialek, & Berry II, 2003, p. 11540) ο· “Mutual information quantifies how much of the information capacity provided by stimulusevoked differences in neural activity is robust to the presence of trial-by-trial response variability. Alternatively, it quantifies the reduction of uncertainty about the stimulus that can be gained from observation of a single trial of the neural response.” (Magri et al., 2009, p. 3) We will discuss these interpretations later on in section 5. For the moment, we will describe the framework and formalism that is used in the aforementioned studies. 1.4 Formalism In what follows, we present, with our own notation, the formalism used in (Cogan & Poeppel, 2011; Dayan & Abbott, 2001; Magri et al., 2009; Schneidman et al., 2003). This formalism will be applied here to a hypothetical experiment where a single subject is presented the same sequence of stimuli over a fixed number of trials, while all the other parameters are controlled. The response is recorded, for each trial, on a fixed time window which is locked to the stimulus presentation. 1.4.1 Stimuli The stimuli are the same at each trial and take on π distinct values over π timesteps. This is a strong assumption made by (Cogan & Poeppel, 2011; Magri et al., 2009) that allows them (and us in this section) to interchangeably talk about timesteps, samples or stimuli. We therefore consider the uniformly distributed discrete random variable π which has finite range {π 1 , π 2 , … , π π } (π ∈ β). For a given trial we have1: 1 ππ‘ β π(π = π π‘ ) = π ; 1.4.2 (π‘ = 1,2, … , π) Response If there are π‘π trials, we consider the response for the π’th trial at the π‘’th timestep (π‘ = 1,2, … , π) to be an EEG signal recorded over πβ channels (or electrodes) in parallel. We write: π π π (π‘) = (π1π (π‘), π2π (π‘), … , ππβ (π‘)); (π = 1,2, … , π‘π) Since the response is assumed to be discrete, each response value at a given trial, for a given electrode and a given sample, takes on one of π distinct values. In mathematical notation we write: ∃ℜ β {πΌ1 , πΌ2 , … , πΌπ } s. t. , ∀π ∈ {1,2, … , π‘π}, ∀π ∈ {1,2, … , πβ}, ∀π‘ ∈ {1,2, … , π}, πππ (π‘) ∈ ℜ 1 The symbol β represents a definition. Adrian Radillo 4 IDH – Project Report Measuring the brain’s response to sound stimuli The responses are repeatedly recorded over π‘π trials in order to estimate the probabilities of the ideal response π . We will explain below how the probability distribution of this πβ-dimensional random variable can be estimated from the empirical responses π π (π‘). 1.4.3 Empirical data and probability estimates The empirical response recorded at the trial π and at the timestep π‘ is in fact a response following the presentation of the stimulus π π‘ . This makes the empirical data ideally suited to estimate conditional probabilities. More precisely, for any arbitrary vector πΆ ∈ ℜπβ within the response’s πΆ range we define ππ‘π (π‘) to be the number of times that the value πΆ is observed at the time π‘ among all the trials. The estimated2 conditional probability of the neural response π taking on the value πΆ given the stimulus π π‘ is : π(π = πΆ|π π‘ ) β ππΆ|π π‘ β πΆ ππ‘π (π‘) π‘π The probability of the response taking on the value πΆ is then deduced by basic probability properties: π πβ ∀πΆ ∈ ℜ , π(π = πΆ) β ππΆ β ∑ ππΆ|π π‘ ππ‘ π=π Assuming that the range of the response π has π§ distinct elements, the last formula allows us to compute the probability distribution of the response π , π« β {π1 , π2 , … , ππ§ }. We also want to define the probability distributions 1,2, … , π, that will be used in the next section. 1.4.4 π«π‘ β {π1|π π‘ , π2|π π‘ , … , ππ§|π π‘ }, for π‘ = Calculations It is now straightforward to compute the entropy of the response, the entropy of the response for a fixed stimulus, and the mutual information3 between the stimulus and the response. The formulas are respectively: π§ π»(π ) = πΌ(π«) = − ∑ ππ log 2 ππ π=1 π π π§ 1 π»(π |π) = ∑ ππ‘ πΌ( π«π‘ ) = − ∑ ∑ ππ|π π‘ πππ2 (ππ|π π‘ ) π π‘=1 π‘=1 π=1 πΌ(π , π) = π»(π ) − π»(π |π) 2 See Appendix B for more details about the bias involved in this estimation, and the known correction techniques. 3 See (de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997, note 20) for computing error margins for the mutual information estimation. Adrian Radillo 5 IDH – Project Report Measuring the brain’s response to sound stimuli 2 Two examples 2.1 Magri et al., 2009 Considering “…EEGs recorded from a male volunteer with a 64-channel electrode cap while the subject was fixating the screen during the repeated presentation of a 5 s-long naturalistic color movie presented binocularly”4 (Magri et al., 2009, p. 17), the authors investigated “…which frequency bands, which signal features (phase or amplitude), and which electrode locations better encoded the visual features present in movies with naturalistic dynamics” (Magri et al., 2009, p. 17). With the notation introduced in section 1.4, the authors defined the variables as follows: ο· π = 625; π‘π = 31; πβ5 = 1; π = 4 ο· ℜπβππ π = { [−π, − 2 ), [− 2 , 0), [0, 2 ), [ 2 , π] } ; where the unit is radians. ο· ℜπππ€ππ = { [0, 18), [18, 44), π π π π [44, 84), [84, 3097] } ; where the unit is ππ 2. Given this, they computed πΌ(π , π) for four different frequency bands of the signal (delta, theta, alpha and beta bands), for the 64 electrodes, and for the two signal features (phase and power). They were looking for the frequency bands, the channels and the signal feature which presented the highest mutual information values. They “…found that only low frequency ranges (delta) were informative, and that phase was far more informative than power at all electrodes” (Magri et al., 2009, pp. 18–20). Moreover, the most informative channel seems to be the electrode PO8. 2.2 Cogan & Poeppel, 2011 We will only present here an outline of the study from (Cogan & Poeppel, 2011). The authors were interested in determining: - - whether the Delta, Thetalow and Thetahigh subbands from a MEG6 signal of a subject listening to speech sentences carry more information about the stimulus (mutual information) than higher frequency bands. whether the speech informations which are processed in these three subbands are independent or not (in which case they would be redundant). We present below the part of the study dealing with the first indent only. The authors presented three different recorded English sentences, 32 times each, to 9 subjects undergoing a MEG recording through 157 channels7. The first 11 seconds of each presentation were analysed. With the notation introduced in section 1.4, the authors defined the variables as follows: ο· π = 2,750; π‘π = 32; πβ8 = 1; π = 4 4 See (Magri, Whittingstall, Singh, Logothetis, & Panzeri, 2009) for a full description of the experiment. Notice that the authors perform a mutual information analysis on each channel individually. 6 Althought this study has not been performed with EEG data, their methodology is still relevant for it. 7 See (Cogan & Poeppel, 2011) for full details on the experimental setting. See also section 3 for the schematic workflow that the authors followed to process their data. 8 Notice that the authors perform a mutual information analysis on each channel individually. 5 Adrian Radillo 6 IDH – Project Report ο· Measuring the brain’s response to sound stimuli π π π π ℜ = { [−π, − ), [− , 0), [0, ), [ , π] } ; where the unit is radians. 2 2 2 2 Given this, they computed πΌ(π , π) for 24 different frequency bands of the signal (ranging from 0 Hz to 50 Hz), for the 157 electrodes, and for the 3 sentences. A general workflow similar to the one they applied to the MEG signal is described in section 3. Part of the results from this study are: - The mutual information is higher across all subjects, trials and sentences, for the three subbands Delta, Thetalow and Thetahigh than for higher frequency bands. The highest values of mutual information in the three frequency subbands appeared to be on electrodes corresponding to “auditory regions in superior temporal cortex” (Cogan & Poeppel, 2011, p. 558). 3 Workflow Beyond the mere theoretic results from the previous studies, we analysed their workflow concerning the processing of the EEG/MEG signals. Since they were similar in many aspects, we present in Figure 1 a standard workflow which steps are common to both studies. The reader is invited to consult Appendix C for a detailed Matlab script that implements this workflow. We now comment the steps of this workflow. 0 – An EEG signal π₯(π‘) is recorded at a fixed sample frequency on a time window βπ‘ following the presentation of a stimulus9, over πβ channels, π‘π trials and π subjects. The workflow of Figure 1 is applied repeatedly along all channels, trials and subjects. 1 – The signal π₯(π‘) is filtered with Finite Impulse Response (FIR) filters into the frequency bands that we want to investigate: π₯ππππ1 (π‘), π₯ππππ2 (π‘), etc. It is important to note that the filtering must not affect the value of the instantaneous phase of the signal, since the final computations will involve those values. 2 – If necessary, the signal can be down-sampled. This has the main effect of reducing the size of the data, and therefore increasing the computational speed. 3 – It is also likely that only a specific time window, smaller than βπ‘, is of interest for the analysis. The unnecessary data is therefore cut at this stage. 4 – The instantaneous phase (between – π and π) is extracted from the analytical signal π₯Μ(π‘): ππππππ (π‘) = arg10(π₯Μπππππ (π‘)) 5 – Using the ibTB toolbox (see Appendix B), the phases of each frequency band are binned into 4 equally spaced interval from – π to π. 9 Recall that in both studies (Cogan & Poeppel, 2011; Magri et al., 2009) each time step is (assumed to be) associated to a different stimulus. 10 Where arg means the argument of the complex analytical signal. Adrian Radillo 7 IDH – Project Report Measuring the brain’s response to sound stimuli Figure 1 - Standard Workflow 0 - EEG signal from 1 subject, 1 channel & 1 trial: x(t) 1 - Bandpass 0-phase shift FIR filters 2 - Down-sampling for computational speed 3 - Selecting part to analyse 4 - Phase extraction Information Breakdown Toolbox 5 Binning 6 Mutual information computation with bias correction techniques 7 – Mutual information values for each frequency band Adrian Radillo 8 IDH – Project Report Measuring the brain’s response to sound stimuli 6/7 – The mutual information values (and other desired information-theoretic quantities) are computed using bias correction techniques with the ibTB toolbox (see Appendix B for more details). 4 Interpreting mutual information for sensory neuroscience In this section, we aim to ask some naïve questions about the meaning that can safely be attached to the mutual information as described above, with no pretention of providing with definitive answers. The first topic that we want to question is the fact that the stimulus is assumed to be different at each timestep. Citing (de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997), (Kayser, Montemurro, Logothetis, & Panzeri, 2009) write11 : “This formalism makes no assumptions about which features of the stimulus elicit the response and hence is particularly suited to the analysis of complex naturalistic stimuli.” However, we notice at least two potential issues with this way of proceeding. ο· Firstly, it relates the information contained in the stimulus to the sample frequency at which the response is recorded. The higher the sample frequency, the greater the amount of information contained in the stimulus is. Theoretically, as the sample frequency tends to infinity, so does the amount of information contained in the stimulus. This doesn’t make much sense in our opinion since for a given presentation of a stimulus in an experiment, the amount of information it contains should be a fixed number. Notice also that resampling of the response’s signal in the workflow presented above has an effect on the stimulus’ information. Note finally that the information theory of continuous random variables (Cover & John Wiley & Sons, 1991; Dayan & Abbott, 2001) might provide with some solutions to the aforementioned theoretical problem of the information tending to infinity. ο· Secondly, the stimulus feature(s) that convey (in reality) the information encoded in the neural response might take the same value several times on the time window it is presented. If this is the case, its probability distribution over this time window is not uniform anymore. Not specifying further the stimulus can therefore lead to a miscalculation fo its information. Our second topic of interest in this section is the discretization method of the neural response. In particular, we wonder what is the impact of the number of bins into which the (phase) response is binned. Using the phase data provided by (Magri et al., 2009), we computed the mutual information value12 of the highest informative electrode (PO8) for 100 differents possibilities of binning13. The results are the following: 11 At the page 4 of their “supplemental data” file available at: http://www.cell.com/neuron/supplemental/S0896-6273(09)00075-0 12 Using the ibTB toolbox and the QE bias correction method. The Matlab script is available in appendix C. 13 This data consists of phase values between – π and π over 625 timesteps. We basically always binned the response in the interval [−π, π], changing only the number of equispaced bins, from 2 to 103. Adrian Radillo 9 IDH – Project Report Measuring the brain’s response to sound stimuli Figure 2 – Influence of the number of bins on the mutual information value It clearly appears that the mutual information is an increasing function of the number of bins. This leads directly to the question: “what is the correct number of bins?” Although we did not ivnestigate furhter the answer to this question, it seems a necessity for further research. 5 Another use of Mutual Information: Consonants perception The notion of mutual information has also been used in other contexts relating to hearing research. In particular, (Christiansen & Greenberg, 2012; Miller & Nicely, 1955) have applied it to behavioral research on consonant perception. We will describe below14 a simplified version of the methodology that these studies have in common, using our own notations to make them comparable. We will further summarize their main results. 5.1 Variables Both studies are investigating the perception of a finite number of distinct consonants – 16 English consonants in (Miller & Nicely, 1955) and 11 Danish ones in (Christiansen & Greenberg, 2012). We denote this set of π consonants by πΆ β {π1 , π2 , … , ππ }. In addition to this, the authors operate several partitions of πΆ according to different phonetic properties that the consonants hold. These properties were: Voicing, Nasality, Affrication, Duration and Place of Articulation in (Miller & Nicely, 1955) and Voicing, Manner and Place in (Christiansen & Greenberg, 2012)15. We denote the π partitions by π·1 , π·2 , … , π·π . For instance, if π·1 is the partition corresponding to the Voicing property of the consonants, and if {π1 , π5 , π8 , π9 , ππ } are voiced and the rest are voiceless, then we define π·11 β {π1 , π5 , π8 , π9 , ππ } and π·12 β πΆ\π·11 so that π·1 = {π·11 , π·12 }. 14 15 The reader can consult appendix A if some details from the following section remain unclear to him. See the cited articles for detailed explanation of these properties. Adrian Radillo 10 IDH – Project Report Measuring the brain’s response to sound stimuli The perception of the consonants (and of their features) is investigated under different conditions. In particular, the authors vary the frequency band of the auditory stimulus16. The physical stimuli are always the consonants. However, the theoretical stimuli considered can either be the consonants or their features. The probabilities of the stimuli are fixed by the experimenters. For instance, the probability of any consonant appearing across all subjects, for a 1 given condition, is π. And from this, the probabilities of any element of the partitions π·1 , … , π·π can be deduced17. 5.2 Experiment In the experiment, the consonants are spoken18 a fixed number of times to the subjects under each condition. At each presentation, the subjects have to tick one of π boxes to tell which consonant they heard. For each condition, these answers can be summarized, across subjects, in “confusion matrices” which can be described as follows: - - They are square matrices Each row corresponds to a different stimulus. The stimulus can either be the π consonants, or the elements of any of the π partitions. Therefore, π + 1 confusion-matrices are generated for each condition. Each column corresponds to the perceived element that the subjects reported. The entry (π, π) of the matrix is the number of times that the element π was reported when the stimulus π was presented, across all subjects. Therefore, the main diagonal entries represent the frequencies of correct perceptions. 5.3 Data analysis Summing the entries of the main diagonal of a confusion matrix and dividing the result by the sum of all the entries of the matrix gives the “recognition score” between 0 and 1 for the feature and the condition to which the matrix corresponds. Both studies claim that an information theoretic analysis of the results involving mutual information is finer than an analysis of the recognition scores because the former takes into account the “error patterns” made by the subjects, as we will explain in 4.5. Consider the stimuli to be represented by the random variable π with range {π 1 , π 2 , … , π π } and the response β with range {π1 , π2 , … , ππ }. We said previously how the probability distribution π¬ β {π1 , π2 , … , ππ } of the stimulus is known. The joint distribution of the response and the stimulus: π² β {π€11 , π€12 , … , π€ππ , … , π€ππ } – where π€ππ = π(π = π π , β = ππ ) – can be found by dividing the confusion matrix for a given condition by the sum of all its entries19. We deduce from this the probability distribution of the response: π« = {π1 , π2 , … , ππ } ; where ππ = ∑ππ=1 π€ππ . 16 (Miller & Nicely, 1955) also vary the signal-to-noise ratio of the stimulus. πΌ πΌ πΌπ π·1 = {π·11 , π·12 } πππ (|π·11 |, |π·12 |) = (πΌ, π − πΌ), π‘βππ π(π·11 ) = πππ π(π·12 ) = 1 − π π 18 In lists of nonsense CV syllables in (Miller & Nicely, 1955), and in CV syllables embedded in carrier sentences in (Christiansen & Greenberg, 2012). 19 Notice that this proability distribution and the following one are only estimated, since computed with a finite number of samples. 17 Adrian Radillo 11 IDH – Project Report Measuring the brain’s response to sound stimuli The mutual information comes finally as: πΌ(β, π) = π»(π«) + π»(π¬) − π»(π²) In order to compare the different mutual information values that they computed, the authors systematically normalized them by the entropy of the stimulus. Hence, they considered the ratio: πΌ(β,π) πΌΜ β . We assume that the authors divided by π»(π¬) (and not by min( π»(π«), π»(π¬)) as would π»(π¬) mathematically be expected20) because they interpreted πΌ(β, π) to be the amount of information transmitted during the perception task. Therefore, πΌΜ becomes the fraction of information about the stimulus that is transmitted during the task. So, to sum up, for each condition (frequency band or signal-to-noise ratio), π + 1 normalized mutual information values are computed (one by confusion matrix), corresponding to the perception of the consonants and to the decoding of the π phonetic features. 5.4 Results Considering the confusion matrix for the stimuli πΆ presented under a given condition, we provide some clues about the meaning of πΌΜ. It lies between 0 and 1. When πΌΜ = πΌ (0 ≤ πΌ ≤ 1), the representative subject’s responses contains πΌ fraction of the stimuli’s information. But the reader must be aware of the fact that this is not equivalent to saying that the representative has correctly perceived 100πΌ% of the stimuli (i.e. the recognition score is of 100%). The reason for this is that the calculation of πΌΜ involves the probabilities of all the responses, correct and incorrect. Therefore, πΌΜ captures the systematicity of the errors in addition to the correct answers. For instance, imagine that all the subjects in the experiment systematically confuse the “t” with the “d”, but provide correct answers for all the other consonants. Then, the recognition score will not be 100% but the normalized information value will be equal to 1. And this idea generalises to less systematic errors. If when hearing a “t”, the subjects are more likely to confuse it with a “d” than with a “k”, all other caracteristic of the response being equal, then the normalized mutual information value multiplied by 100 will be higher than the recognition score. This is what leads the authors of both articles to say that πΌΜ is sensitive to error patterns. However, the simple knowledge of πΌΜ on its own does not allow a qualitative analysis of these patterns. A direct analysis of the confusion-matrix entries is necessary for this. Beyond these theoretical remarks, the form of the results from the two cited studies are presented below. (Miller & Nicely, 1955) consider each features π·π as a communication channel. High (normalized) mutual information value for the channel π·π across different conditions of noise or frequency distortion indicates a robustness of this communication channel to noise or frequency distortion. In this case, the authors also say that the feature π·π is “discriminable”. A low mutual information value indicates a vulnerability of the feature to noise or frequency distortion and the authors say the feature to be “non-discriminable”. (Christiansen & Greenberg, 2012)’s conditions consisted of seven frequency bands (π1 ; π2 ; π3 ; π1 + π2 ; π1 + π3 ; π2 + π3 ; π1 + π2 + π3 ). Considering that the stimuli were the same in thoes seven conditions, the authors compared the mutual information values of the last four 20 Indeed, since 0 ≤ πΌ(π , π) ≤ min(π»(π ), π»(π)), it is possible that π»(π ) < π»(π) and that in consequence, the π»(π ) ratio πΌΜ becomes bounded above by the ratio < 1. π»(π) Adrian Radillo 12 IDH – Project Report Measuring the brain’s response to sound stimuli conditions with those of the first three (mutli-bands vs single-band). They where investigating whether πΌΜ∑ ππ = ∑ πΌΜππ . This property is defined by the authors as the linearity of the cross-spectral integration of the phonetic information. They mainly found that the integration of the information corresponding to the feature Place is highly non-linear. 5.5 Personal remarks In computing their value πΌΜ, none of the above study applied bias correction techniques. (Miller & Nicely, 1955, p. 348) state explicitly: “Like most maximum likelyhood estimates, this estimate will be biased to overestimate [πΌ(β, π)] for small samples; in the present case, however, the sample is large enough that the bias can safely be ignored.” The authors provide no more justification for their statement, and although we haven’t investigated the matter in detail, it would be prudent to seek for stronger justifications. The ibTB toolbox allows the user to apply these bias correction techniques to the calculation of πΌΜ. A Matlab script implementing this calculation21 for the consonant-confusion matrix provided in (Christiansen & Greenberg, 2012) – Table II (a) – has been written (see appendix C ). We found that the “naïve” πΌΜ had the value of 0.2898 whereas the bias corrected one showed 0.2195. Another criticism that we want to make concerning (Christiansen & Greenberg, 2012) is the following. When comparing the πΌΜ from different conditions (frequency bands combinations), they do not provide with the statistical significance of the differences they observe. They use expressions as vague as “close to”, “not quite linear”, “increases only slightly”, “only slightly less compressive”,etc. (Christiansen & Greenberg, 2012, p. 154). We consider that a thorough investigation of the significance of these differences is required, and constitutes therefore a path for further studies. Conclusion We have seen how mutual information is starting to be used in interpreting EEG/MEG signals in sensory neuroscience, with a theoretical framework steming from spike trains analysis (de Ruyter van Steveninck et al., 1997). We presented this framework with our own notations in order to capture all the mathematical details that might not appear in the cited articles. Concerning EEG/MEG responses, the low-frequency phase response is deemed to carry more information about sensory stimuli (visual and auditory, depending on the experiment) than higher-frequency responses. We implemented the standard workflow in Matlab that leads to those results. However, we noticed several potential weaknesses in the theoretical interpretation of the mutual information values that encourage to further investigate this field of research. Finally, we aknowledged that information theory, and in particular mutual information analysis, was also used in the cognitive/behavioral study of speech perception. This diversified applicability of the notion of mutual information suggests that furhter research about the link between sensory stimulation, neurophysiological recordings and behavioral response (perception) is attainable. 21 Note however that since we haven’t studied in details the effect of the bias correction technique that we used (called quadratic extrapolation), our calculations might be flawed. Further verifications are therefore required. Adrian Radillo 13 IDH – Project Report Measuring the brain’s response to sound stimuli Bibliography Christiansen, T. U., & Greenberg, S. (2012). Perceptual Confusions Among Consonants, Revisited-Cross-Spectral Integration of Phonetic-Feature Information and Consonant Recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 147–161. Cogan, G. B., & Poeppel, D. (2011). A mutual information analysis of neural coding of speech by lowfrequency MEG phase information. Journal of Neurophysiology, 106(2), 554–563. Cover, T. M., & John Wiley & Sons, I. (1991). Elements of Information Theory. Wiley-Interscience. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscienceβ―: computational and mathematical modeling of neural systems. Cambridge, Mass.: Massachusetts Institute of Technology Press. de Ruyter van Steveninck, R. R., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproductibility and Variability in Neural Spike Trains. Science, 275, 1805–1808. Kayser, C., Montemurro, M. A., Logothetis, N. K., & Panzeri, S. (2009). Spike-Phase Coding Boosts and Stabilizes Information Carried by Spatial and Temporal Spike Patterns. Neuron, 61(4), 597– 608. doi:10.1016/j.neuron.2009.01.008 Magri, C., Whittingstall, K., Singh, V., Logothetis, N. K., & Panzeri, S. (2009). A toolbox for the fast information analysis of multiple-site LFP, EEG and spike train recordings. BMC Neuroscience, 10(1), 81. Miller, G. A., & Nicely, P. E. (1955). An Analysis of Perceptual Confusions Among Some English Consonants. The Journal of the Acoustical Society of America, 27(2), 338–352. Nemenman, I., Bialek, W., & de Ruyter van Steveninck, R. (2004). Entropy and information in neural spike trains: Progress on the sampling problem. Physical Review E, 69(5). Panzeri, S., Senatore, R., Montemurro, M. A., & Petersen, R. S. (2007). Correcting for the Sampling Bias Problem in Spike Train Information Measures. Journal of Neurophysiology, 98(3), 1064– 1072. Pola, G., Thiele, A., Hoffmann, K.-P., & Panzeri, S. (2003). An exact method to quantify the information transmitted by different mechanisms of correlational coding. NETWORK: COMPUTATION IN NEURAL SYSTEMS, (14), 35–60. Rényi, A. (2007). Probability theory. Mineola, N.Y.: Dover Publications. Sanei, S., & Chambers, J. (2007). EEG signal processing. Chichester, England; Hoboken, NJ: John Wiley & Sons. Schneidman, E., Bialek, W., & Berry II, M. J. (2003). Synergy, Redundancy, and Independence in Population Codes. The Journal of Neuroscience, 23(37), 11539–11553. Strong, S., Koberle, R., de Ruyter van Steveninck, R., & Bialek, W. (1998). Entropy and Information in Neural Spike Trains. Physical Review Letters, 80(1), 197–200. Adrian Radillo 14 IDH – Project Report Measuring the brain’s response to sound stimuli Appendix A: Information Theory Contents Introduction........................................................................................................................................... 16 1 Discrete Random Variable with finite range ................................................................................. 16 2 Information.................................................................................................................................... 18 3 2.1 Elucidation of key terms ........................................................................................................ 18 2.2 Mathematical definition of information ............................................................................... 19 2.2.1 Abstract examples ......................................................................................................... 20 2.2.2 Practical examples ......................................................................................................... 21 Study of two random variables ..................................................................................................... 22 3.1 Joint distribution of two random variables ........................................................................... 22 3.1.1 Experiment 1: ................................................................................................................ 22 3.1.2 Experiment 2: ................................................................................................................ 23 3.2 Additivity of information ....................................................................................................... 24 3.3 Joint entropy ......................................................................................................................... 25 3.4 Conditional probability and conditional entropy .................................................................. 25 3.5 Definition of Mutual Information .......................................................................................... 26 3.5.1 Definition involving the joint distribution ..................................................................... 26 3.5.2 Definition involving the conditional entropy ................................................................ 26 3.6 Examples................................................................................................................................ 27 Adrian Radillo 15 IDH – Project Report Measuring the brain’s response to sound stimuli Introduction The aim of this appendix is to explain in great detail the notions of “information” and of “mutual information”. We will try to give the most basic explanations of every concept. Moreover, this document includes a variety of examples designed to help the reader to grasp the new notions that are introduced. We first define the concept of discrete random variable with finite range and illustrate the related notions of probability distribution, mean and variance. Then we introduce the concept of Information, as it is considered in Information Theory, and discuss its meaning in comparison with the one of the Variance. Finally, we consider the notion of Mutual Information between two random variables, discuss its possible interpretations and compare it with the correlation coefficient. 1 Discrete Random Variable with finite range A discrete random variable with finite range has three features. Firstly the variable must have a name. Here, and in the rest of the document, we will denote an arbitrary random variable by π. Secondly, the variable has a Range, that is to say, it has a finite number of (real) values that it can take on. For instance, if the range of π, that we will denote by Range(π), is the set of values (π₯1 , π₯2 , … , π₯π ), then only equalities of the type π = π₯π (π = 1,2, … , π) can hold. Thirdly, a random variable has a probability distribution, or with a simpler name, a distribution. A distribution, usually denoted by π«,is a set of probabilities associated with the range of the random variable. Thus, saying that π has distribution π« = (π1 , π2 , … , ππ ) means that π takes on the value π₯π with probability ππ (π = 1,2, … , π). Notice that in order for π« = (π1 , π2 , … , ππ ) to be a probability distribution, it must fulfill the following conditions: 0 ≤ ππ ≤ 1, ∀π ∈ {1,2, … , π}, { π ∑ ππ = 1 π=1 A random variable can be considered as a Random Experiment π having a finite number of possible different outcomes (π΄1 , π΄2 , … , π΄π ) and which shows, each time it is run, one of them with a fixed probability, i.e. π(π΄π ) = ππ (π = 1,2, … , π). When π(π΄π ) = 1, π΄π is called a “sure event” and when π(π΄π ) = 0, π΄π is called an impossible event. We will now provide several examples of random variables associated with random experiments. The notation that we will use is the same as the one introduced previously. ο· When tossing a fair coin, the outcome can be head or tails, with probability .5 for each. Hence the following distribution: Adrian Radillo 16 IDH – Project Report Measuring the brain’s response to sound stimuli Figure 3 - Probability Distribution of a coin toss 1 Here, we can consider that the event “Heads” is π΄1 , “Tails” is π΄2 , and π1 = π2 = 2 . ο· When throwing a fair die, there are 6 possible outcomes with probability 1/6 each. Hence the following distribution: Figure 4 - Probability distribution of a fair die 1 Here, we can consider that the event “1” is π΄1 , “2” is π΄2 , …, “6” is π΄6 and π1 = π2 = β― = π6 = 6 . ο· In an English game of scrabble the 26 letters of the alphabet and the blanks are present in the shuffling bag in the following quantities: A B C D E 9 2 2 4 12 F G H I J K L M N O P Q R S T U V W X Y Z bl 2 3 2 9 1 1 4 2 6 8 2 1 6 4 6 4 2 2 1 2 1 2 Since there are 100 pieces in total, dividing each of these numbers by a hundred gives the probability for each letter of being picked at the first draw. Thus, we deduce the following probability distribution: Adrian Radillo 17 IDH – Project Report Measuring the brain’s response to sound stimuli Figure 5 - Probability distribution of the scrabble letters As in the previous examples, each letter constitutes an event with respect to the random experiment of “drawing the first piece”. ο· One can also create arbitrary random variables, such as the following : We define the range as (-12, -7, 2, 6, 11) and the probability distribution as (.5 .1 .2 .1 .1). Thus: Figure 6 - Probability distribution of an arbitrary random variable 2 Information 2.1 Elucidation of key terms The use of the word ‘information’ is misleading. In laymen’s terms, information is confused with the content of a message. An individual having no knowledge of information theory would happily agree that the words “tree” and “star” do not carry the same information. As a matter of fact, from the point of view of information theory, these two words can be considered to carry the same amount of information because they both have four letters. The reader will have noticed how much the addition of the word “amount” changes the meaning of the sentence. In information theory, and in this document, we will from now on always use the word information as meaning amount of information. This is both a space saving measure and to follow the convention in this field. Adrian Radillo 18 IDH – Project Report Measuring the brain’s response to sound stimuli Another semantic difficulty that appears in information theory is related to the fact that the notion of information is also called entropy and uncertainty. We will therefore consider those three words as perfectly interchangeable in what follows. In reality, those three words reflect three different ways of considering the same mathematical22 quantity. 2.2 Mathematical definition of information At our level, information is a quantity that is associated to a discrete random variable with finite range and that depends only on its distribution. Let π be a random variable with distribution (π1 , π2 , … , ππ ). The information of π is defined by the formula23: π π»(π) = ∑ ππ πππ2 π=1 1 ππ This can be thought of as a measure of “how unpredictable” the variable is. Notice that for random variables having finite range, the value of the entropy is bounded above. It is indeed maximal when π1 = π2 = β― = ππ = 1 π , i.e. when every outcome has the same probability of occurring. One can intuitively recognise this case as the one in which the outcome of a single trial of the experiment is the most “unpredictable”. In this case, the value of the entropy is π»(π) = log 2 π. On the other hand, the definition implies that the entropy is always positive, and it is minimal (π»(π) = 0) when the outcome on a single trial is totally predictable, i.e. when one of the ππ ′s is 1 and all the others are 0. So in a nutshell, the entropy of a random variable with a range of n values is always contained between 0 and log 2 π. The following curve represents the maximal entropies that a discrete random variable can possess as a function of the number of values that the variable can take on. Figure 7 - Maximal entropy of a system as a function of its number of possible states 22 In fact, this quantity can even be considered as a physical one, since the term entropy was introduced by Boltzmann who first applied information theory to Physics. 23 We may assume that none of the ππ ’s is equal to zero. πππ2 is the base-2 logarithm. Adrian Radillo 19 IDH – Project Report Measuring the brain’s response to sound stimuli Information is traditionally expressed in “bits”. One “bit” (contraction of the words “binary digits”) is the amount of information that is necessary to distinguish any two distinct objects. One bit of information is required to make a difference between 0 and 1, A and J or “John” and “my house”. Because it depends only on the number of possible states and their respective weights, Information is a dimensionless quantity. We will come back on this point in 3.2.2. 2.2.1 Abstract examples We now show nine arbitrary distributions24 with their respective entropies: Figure 8 - nine examples of distributions with their entropy The first, second and third rows of graphs above concern respectively variables with ranges of two five and twelve values. We would like to make a number of comments on these graphs: - As we have said, for a given range, the maximal entropy is attained when all the probabilities are equal. Thus, the first graphs of the rows 1 and 2 above fulfill this condition. It would be tempting to think that the more outcomes a random experiment has, the more entropy it shows. However, this is a wrong intuition as we explain immediately below. - The first graph of the third row constitutes a counter-example to the previous intuition. A system with 12 different possible states can show a lower entropy than a system with five possible states. When this is the case, we can interpret that the former is “more predictable” than the latter. The shapes of the distributions of the first graphs of the rows 2 and 3 give a visual representation of this idea. Whereas each outcome has the same probability of occurring in the system with 5 states, the system with twelve states shows a very high probability of adopting the state 3 and a very low probability of being in any other state. 24 Since the specific values in the range do not affect the entropy, we present all the distributions with standardised x-coordinates, to facilitate the visualisation of the distribution. Adrian Radillo 20 IDH – Project Report - 2.2.2 Measuring the brain’s response to sound stimuli The reader might have noticed that the second and third graphs of the second row have the same entropy. This is because they have the same distribution up to a permutation of their values. As we already mentioned, this emphasises the fact that the entropy of a random variable is only a property of its distribution and does not depend on its particular range. Practical examples Entropy is a quantity that is dimensionless. This means that it is possible to compare entropies of systems which are each other totally heterogeneous, provided that their states observe a probability distribution. It should be clear to the reader that a fair coin toss has an entropy of 1 bit and that a fair die throw has an entropy of log 2 (6) = 2.585 bits. We might interpret this difference of entropy by saying that “there is more uncertainty in the result of throwing a die than in the result of a coin toss”. This idea fits quite well to our intuition. If we come back to our example of the game of scrabble, the experiment of drawing the first letter at the beginning of the game possesses an entropy of 4.3702 bits. If we compute this quantity for a game of Finnish scrabble in which the distribution of the letters is different, we obtain a value of 4.2404 bits. So there is more uncertainty in drawing the first letter of an English scrabble than of a Finnish one. A simplified version of the EuroMillion lottery game is the following. Five “main numbers” are drawn at random from a box containing fifty balls, numbered from 1 to 50. Since the order in which they are drawn does not matter, the number of possible drawings is (50 ) = 2,118,760. Assuming that 5 each final draw has the same probability of occurrence, the entropy of such an experiment is log 2 (2,118,760) = 21.0148 bits. Before we go on to define a new concept, let us summarise this first section of the document. So far, we have seen three different quantities that can be extracted from a discrete random variable with finite range: the mean, the variance and the information. The mean depends heavily on the values that the variable can take on since it represents its average value. The variance does not depend directly on the values of the range but rather on their scattering around the mean. It is therefore particularly representative of the distance between the most weighted values. Eventually, the information does not depend in the least on the values of the range, although its value is bounded by the number of elements in the range. Finally, It is worth answering the following question: What can be said about a random variable when we are given its information value? 1- If the information is equal to zero, then there is one value that the random variable takes on with probability 1. But we have no means of telling which value it is. 2- If there is a positive integer π such that log 2 π is less or equal to the information, then the range of the random variable contains at least π distinct values which have corresponding probabilities greater than 0. 3- There is no way of finding an upper bound for the number of elements in the range of the random variable. It is therefore possible for two random variables to have the same information Adrian Radillo 21 IDH – Project Report Measuring the brain’s response to sound stimuli value but ranges with different cardinalities25. Moreover, if the information value is very low so that the variable can be considered as quite “predictable”, it is not possible to tell what values the random variable is highly probable to take on. So let us insist once again on the fact that knowing the amount of information of a random variable does not entitle us to know anything about the values in its range, nor about their respective probabilities. The second and third graphs of the second row in section 3.2.1 help to make this point clear. 3 Study of two random variables 3.1 Joint distribution of two random variables Let π and π be random variables with respective ranges (π₯1 , π₯2 , … , π₯π ) and (π¦1 , π¦2 , … , π¦π ) and respective distributions (π1 , π2 , … , ππ ) and (π1 , π2 , … , ππ ). The joint distribution of (π, π) is the set of probabilities (π11 , π12 , … , πππ ) where πππ (π = 1,2, … , π ; π = 1,2, … , π) is the probability that “π = π₯π AND π = π¦π ” . Considering π and π as two random experiments which are run in parallel, and considering their respective possible outcomes to be (π₯1 , π₯2 , … , π₯π ) and (π¦1 , π¦2 , … , π¦π ), πππ becomes the probability that the pair (π₯π , π¦π ) is obtained as joint outcome. We give below several examples of this notion. 3.1.1 Experiment 1: 3.1.1.1 One die throw A fair die is thrown. We associate the random variable π to the events (“even number”,”odd number”) and the random variable π to (“1”, “2”, “3”, “4”, “5”, “6”). It is clear that their respective distributions are (1/2,1/2) and (1/6,1/6,1/6,1/6,1/6,1/6). Since π and π refer to the same die throw, it is impossible to have the pairs of events (“1”,“even”), (“3”,“even”), (“5”,“even”), (“2”,“odd”), (“4”,“odd”), (“6”,“odd”). And since each of the 6 remaining possible pairs has the same probability of occurrence, this probability is 1/6. We show a graphic representation of this joint distribution at the end of section 4.1.1.2. 3.1.1.2 Two-dice throw Two fair dice that can be distinguished are thrown together. Assume that they show their results independently one from another. We associate the random variable π to the events (“Die 1 shows an even number”,”Die 1 shows an odd number”) and the random variable π to (“Die 2 shows 1”, “Die 2 shows 2”, “Die 2 shows 3”, “Die 2 shows 4”, “Die 2 shows 5”, “Die 2 shows 6”). In this setting, all 12 pairs of values can occur with equal probability 1/12. Here are the graphical representations of the joint distributions of experiments 1 and 2: 25 The cardinality of the range is the number of distinct elements that it contains. Adrian Radillo 22 IDH – Project Report Measuring the brain’s response to sound stimuli Figure 9 - Joint distributions for experiment 1 3.1.2 Experiment 2: A taxi driver picks up one person at random among 10 people waiting at a taxi station. Denoting by πΎπ (π = 1,2, … 10) the event “the person π is picked up”, we have π(πΎπ) = 1/10. The taxi driver can only take people to four areas in town: a poor area, a touristic area, a business area and a shopping area. Denoting by π΄π (π = 1,2,3,4) the event “a random person wants to go to the area j”, we assume that the following probabilities are known: π(π΄1) = 0.1, π(π΄2) = 0.4, π(π΄3) = 0.3, π(π΄4) = 0.2 In what we will call the Independent Case, the ten people constitute a perfectly random sample of the population of the town. In this case, the joint probability π(π΄π, πΎπ) (π = 1,2,3,4 ; π = 1,2, … 10) that the person π wants to go to the area π is π(π΄π)π(πΎπ)26. On the other hand, if among the ten people, six of them are tourists and definitely want to go to the touristic area, the joint distribution will change27. We present below the joint distributions in these two cases. 26 As we will explain further in the text, this corresponds to the mathematical definition of the two random variables being independent. 27 We intentionnally did not put the details of the calculations because this would take us too far from our main purpose. However, notice that in this last case, we implicitly changed the distribution of the variable referring to the 4 places. This has become: π(π΄1) = 0.04, π(π΄2) = 0.76, π(π΄3) = 0.12, π(π΄4) = 0.08. This has as effect to reduce the entropy of this variable. See section 4.5 for a visual representation. Adrian Radillo 23 IDH – Project Report Measuring the brain’s response to sound stimuli Figure 10 - Joint distributions for experiment 2 3.1.2.1 Notion of independence Experiments 1 and 2 above were designed to give to the reader an intuitive idea of what it means for two random variables to be independent versus dependent. In experiment 2, we gave explicit names to both subcases. In experiment one, the independent case is the “two-dice throw” case and the dependent case is the “one-die throw”. The reader will have noticed that the dependence of two random variables is related to their joint distribution. The formal definition of independence for two discrete and finite random variables is the following: The random variables π and π with respective ranges (π₯1 , π₯2 , … , π₯π ) and (π¦1 , π¦2 , … , π¦π ), respective distributions (π1 , π2 , … , ππ ) and (π1 , π2 , … , ππ ), and joint distribution (π11 , π12 , … , πππ ) (with πππ = π(π = π₯π , π = π¦π ) (π = 1,2, … , π ; π = 1,2, … , π) ) are said to be independent if for all π’s and for all π’s we have: πππ = π(π = π₯π )π(π = π¦π ) = ππ ππ (π = 1,2, … , π ; π = 1,2, … , π) 3.2 Additivity of information The very mathematical definition of information provides it with an important property which is called additivity. This property means that if two random variables π and π are independent, the information of the random variable created from all the possible pairs of values (π₯, π¦) (π₯ ∈ π ππππ(π), π¦ ∈ π ππππ(π)) is equal to the sum of the information of π and of the information of π. Thus: π»((π, π)) = π»(π) + π»(π) Adrian Radillo 24 IDH – Project Report Measuring the brain’s response to sound stimuli 3.3 Joint entropy The quantity π»((π, π)) is sometimes called the “joint entropy of π and π”28. If we use our previous notations (4.1.2.1), π»((π, π)) corresponds to the information of (π, π) with distribution (π11 , π12 , … , πππ ). Therefore, everything that has been said about the concept of information in section 3 applies here. π»((π, π)) is a measure of the “unpredictability” of the pairs of simultaneous values (π₯π , π¦π ) (π = 1,2, … , π ; π = 1,2, … , π). The more “unpredictable” the pairs are, the higher the joint entropy. As intuition could suggest, the joint entropy is maximum when the two variables are independent (4.2.1), and is minimum (0) when π and π are both totally predictable. This can be formally proven (Rényi, 2007, pp. 557–558) and is summarised in the following inequality which is always true: 0 ≤ π»((π, π)) ≤ π»(π) + π»(π) We might interpret it as follows: “The uncertainty of a pair of values (π₯π , π¦π ) of (π, π) cannot exceed the sum of the uncertainties of π and π”. 3.4 Conditional probability and conditional entropy Consider again two random variables π and π with same range as above. We now look at the probability that π takes on a particular value, given that π takes on a specific value too. This kind of probability is called a conditional probability. Here is the notation and definition that we will use to denote the “probability that π takes on the value π₯π given that π takes on the value π¦π ”29: ππ|π β π(π = π₯π |π = π¦π ) β π(π = π₯π , π = π¦π ) π(π = π¦π ) It can be proven30 that π«π β (π1|π , π2|π , … , ππ|π ) (π = 1,2, … , π), is a probability distribution per se. It therefore has an information value associated to it, π»(π«π ). This information value can be considered as the entropy of π when π is fixed to the value π¦π . The set of all π»(π«π ), when k goes from 1 to n, forms the range of a new random variable with probability distribution (π1 , π2 , … , ππ ). This leads us to the notion of conditional entropy: π π»(π|π) β ∑ ππ π»(π«π ) π=1 This quantity represents the average entropy of the variable π, once π has taken on a specific value. Notice that the following is true31: π»((π, π)) = π»(π) + π»(π|π) = π»(π) + π»(π|π) 28 See (Cover & John Wiley & Sons, 1991, p. 15) This formula applies only in cases where π(π = π¦π ) ≠ 0. The notion of conditional probability does not make sense otherwise. 30 See (Rényi, 2007, p. 56) 31 See (Cover & John Wiley & Sons, 1991, p. 16) for formal proof. 29 Adrian Radillo 25 IDH – Project Report Measuring the brain’s response to sound stimuli 3.5 Definition of Mutual Information The Mutual Information of two random variables depends on the respective distributions of the two variables and on their joint distribution. It has been defined and interpreted in several ways: - The relative information given by one variable about the other (Rényi, 2007, p. 558) The relative entropy between the joint distribution (π11 , π12 , … , πππ ) and the product - distribution (π1 π1 , π2 π1 , … , ππ ππ , … , ππ ππ )32 (Cover & John Wiley & Sons, 1991, p. 18) The reduction in the uncertainty of one variable du to the knowledge the other one (Cover & John Wiley & Sons, 1991, p. 20). 3.5.1 Definition involving the joint distribution One formal definition of mutual information is the following: Let π and π be defined as in 3.1.2.1. Then the mutual information of π and π is, π»(π, π) β π»(π) + π»(π) − π»((π, π)) If we recall that π»(π) + π»(π) is the joint entropy of π and π in a “hypothetical case” where they are independent, then the mutual information of π and π is the difference between this “hypothetical joint entropy” and the “real” joint entropy. At least two perspectives can be drawn from there: - - Since the joint entropy is strongly related to the independence of the two variables, the mutual information becomes a measure of their relative dependence. More specifically, if the joint entropy is close to its maximum possible value (i.e. the case where π and π are independent), then the mutual information is low. On the contrary, if the joint entropy is far from its maximum value, then the variables π and π are strongly related and the mutual information is high. This is the reason why the mutual information “can be considered as a measure of the stochastic dependence between the random variables π and π” (Rényi, 2007, p. 559). If we consider that the joint entropy is a measure of the “unpredictability” of the pairs (π₯π , π¦π ), and that the maximum value for this “unpredictability” is π»(π) + π»(π), then the difference between these two quantities becomes the relative “loss of uncertainty” of the pairs (π₯π , π¦π ), between the cases when the variables are independent and the “real” case. Hence, this difference can also be seen as the “predictability”33 of the pairs (π₯π , π¦π ). Although we will not prove it here34, the mutual information cannot exceed the minimum entropy of π and π: π»(π, π) ≤ min[π»(π), π»(π)]. Notice also that π»(π, π) = π»(π, π). That is to say, π carries as much information about π as π does about π. 3.5.2 Definition involving the conditional entropy Another equivalent, and maybe more often used, definition of mutual information is: π»(π, π) β π»(π) − π»(π|π) 32 Where we still use the same notations as in the preceding sections. Here, we implicitly consider that a loss of uncertainty is equivalent to a gain of predictability. 34 Again, see (Rényi, 2007) for the formal proof. 33 Adrian Radillo 26 IDH – Project Report Measuring the brain’s response to sound stimuli Given what has been said above, this quantity can be interpreted here in at least two ways: - It is the entropy of π minus the average entropy of π when π is fixed. It is the variability of π which is linked to the variability of π. 3.6 Examples ο· The following are the values of π»(π), π»(π) and π»(π, π) for the two experiments of section 4.1. Figure 11 - mutual information and entropies for experiments 1 and 2 The two graphs on the left are the “dependent cases”, the two on the right represent the “independent cases”. As expected, the mutual information is equal to zero in the latter. ο· We now provide a last example of the notion of mutual information. Let S be a signal that can take on 4 different values: π1 , π2 , π3 , π4 with probability 1/4 for each one of them. Let R be a response that can also take on 4 distinct values π 1 , π 2 , π 3 , π 4 . We will consider R to be a “tracker” of the signal S. We will therefore consider 5 different cases: o Case 1: R fully tracks S. That is to say, there is an injective function π: π ππππ(π) → π ππππ(π ) such that π = π(π). o Case 2: R tracks only three of the four outcomes of S and provides a random outcome for the remaining value of S. o Case 3: R tracks only two of S’s outcomes and provides a random outcome for the remaining values of S. o Case 4: R tracks only one of S’s outcomes and provides a random outcome for the remaining values of S. o Case 5: R tracks none of S’s outcomes and provides a random outcome for all the values of S. Adrian Radillo 27 IDH – Project Report Measuring the brain’s response to sound stimuli After generating 100,000 joint occurrences of R and S with Matlab, here are the graphs of the estimated Mutual Informations obtained for the 5 cases, and the respective entropies of the signal and of the response: Figure 12 - Entropies and mutual information for the tracker example Adrian Radillo 28 IDH – Project Report Measuring the brain’s response to sound stimuli Appendix B: Information Breakdown toolbox (Magri et al., 2009) developped the Information Breakdown Toolbox (ibTB) for the software Matlab (The Mathworks, Natick, MA). “This toolbox implements a novel computationally-optimized algorithm for estimating many of the main information theoretic quantities and bias correction techniques used in neuroscience applications.” (Magri et al., 2009, p. 1) The central quantity that this toolbox enables to compute is the mutual information between a stimulus and its neural response. Use of the toolbox The toolbox is a folder containing several Matlab m.files. The main function to compute mutual information is information.m. With our notation from 1.5, this function takes as argument {π π (π‘)|π = 1,2, … , π‘π ; π‘ = 1,2, … , π} and returns the output πΌ(π , π). In matlab notation, the input of this function is a matrix Rmatlab which has dimensions ch x tr x n, and the output is just a scalar. The function information.m takes also optional arguments which specify the computational techniques and the bias correction methods that the user wants to use. The user can also ask for additional outputs corresponding to the information breakdown quantities. Finally, it is worth mentioning the necessity for the entries of the input matrix Rmatlab to be “binned” into a finite number of non-negative integer values. The function binr.m does this. Information breakdown The following studies, (Magri et al., 2009; Pola, Thiele, Hoffmann, & Panzeri, 2003; Schneidman et al., 2003) mention a way in which the mutual information can be decomposed: πΌ = πΌπππ + πΌπ ππ−π ππ + πΌπππ−πππ + πΌπππ−πππ The toolbox can compute all the terms of this sum. Each one of them has a neuroscientific meaning that we will briefly describe here35. - πΌπππ is the sum of all the mutual information values generated from each channel individually. It therefore corresponds to the mutual information of a hypothetical response which would have all its channels independent from each other36. The formal definition of πΌπππ is : πΌπππ β ∑πβ π=1 πΌ(ππ , π) - πΌπ ππ−π ππ is “the amount of redundancy [in the neural response] specifically due to signal correlations”(Magri et al., 2009, p. 5). This is why this quantity is negative or 0. Signal correlations are either correlations between several stimuli, or correlations due to the fact that several channels in the response are processing the same part of the stimulus (Schneidman et al., 2003, p. 11542). - πΌπππ−πππ “reflects the contribution of stimulus-independent correlations” between several channels of the neural response (Magri et al., 2009, p. 5). πΌπππ−πππ has a technical meaning that goes beyond the scope of the present document. - 35 36 The reader is invited to consult the given references for more detail. Cf. 3.1 and 3.2 of appendix A on the notions of additivity of information and independence. Adrian Radillo 29 IDH – Project Report Measuring the brain’s response to sound stimuli Bias correction techniques Eventually, as (Panzeri, Senatore, Montemurro, & Petersen, 2007) have pointed out, estimating the conditional probabilities with the formula37 ππΆ|π π‘ β πΆ (π‘) ππ‘π π‘π is subject to a systematical error since πΆ (π‘) ππ‘π . π‘π→∞ π‘π the real probabilities need an infinite number of trials to be found: ππΆ|π π‘ = lim This so-called “limited sampling bias” can provoke an error “as big as the information value we wish to estimate” (Magri et al., 2009, p. 9). The toolbox enables the user to apply several bias correction techniques to reduce the impact of this bias. The reader is invited to consult (Magri et al., 2009) for further details. See also (Nemenman, Bialek, & de Ruyter van Steveninck, 2004; Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998) for further discussions on the sampling bias problem. . 37 Cf. 1.5.3 Adrian Radillo 30 IDH – Project Report Measuring the brain’s response to sound stimuli Appendix C: Matlab Codes We present below the matlab scripts used in our project. Each script is contained in an individual box. Naive computation of information theoretic quantities function logbase2 = log_2(x) %log_2(x) computes the base-2 logarithm of x logbase2=log(x)/log(2); end function H = entropy_perso(P) %takes as argument a (1-by-n) probability mass function P and outputs the %entropy of this distribution P=P(find(P)); %Removing all the zeros from P H=-sum(P.*log_2(P)); end function countvec = count_vectors(Data) %Argument: an M-by-N-by-O matrix %Returns: an M+1-by-K matrix where each column contains, in its M first % rows, the M-by-1 vectors contained in Data and in its M+1 entry, % the number of times that this vector occurs in Data, accross N % and O. [M N O]=size(Data); pairs=reshape(Data,M,N*O); pairs_mod=pairs; index=[]; l=length(pairs_mod); i=1; while i<=l count = 1; A=pairs_mod(:,i); j=i+1; while j<l if A==pairs_mod(:,j); count=count+1; pairs_mod=[pairs_mod(:,1:j-1) pairs_mod(:,j+1:l)]; l=l-1; else j=j+1; end end if j==l if A==pairs_mod(:,j) count=count+1; pairs_mod=pairs_mod(:,1:j-1); l=l-1; end end index=[index count]; i=i+1; end countvec=[pairs_mod;index]; end Adrian Radillo 31 IDH – Project Report Measuring the brain’s response to sound stimuli function C = Cond_entropy_perso(Data,Q) %Argument: Data is a M-by-N-by-O response matrix, where M is the number of %channels of the response, N is the number of trials and O is %the number of stimuli %Q is the 1-by-O probability distribution of the stimuli. %Returns: the conditional entropy H(R|S) [M, N, O]=size(Data); F=zeros(1,O); %each F(i) corresponds to p(s)*H(R|s) for k=1:O %loop on stimuli G=count_vectors(Data(:,:,k)); %counts the distinct response vectors %across all trials in the stimulus k P=G(M+1,:)/sum(G(M+1,:)); %builds the distribution p(R|s) F(k)=Q(k)*entropy_perso(P); %preprocesses the final term in the %expectation end C=sum(F); %expectation end function goal = binn_imitate(x,I) %x is d-by-tr-by-s, where: %d is the number of dimensions of the response %tr is the number of trials, %s is the number of time points (or stimuli) %I is 2-by-n, where: %n is the number of bins, and the columns are bounds of intervals %the function returns the matrix "goal" which is of size "size(x)" %"goal" contains a 1 at the entries (w,y,z) whenever the value of %x(w,y,z) is in the bin k (0<=k<=n-1). %this function is designed to imitate the function binr.m from the ibTB %toolbox [d tr s]=size(x); n=size(I,1); %size of x and I goal=zeros(d,tr,s); %size of the output for m=1:d %loop on dimensions for j=1:tr %loop on trials for i=1:s %loop on time points %logical n-by-1 array with a 1 only when x(j,i) is in the bin logic=and(x(m,j,i)>=I(:,1),x(m,j,i)<I(:,2)); %case where x(i) is equal to the upper bound of the last interval if sum(logic)==0 logic(n)=1; end goal(m,j,i)=find(logic)-1; end end end end Adrian Radillo 32 IDH – Project Report Measuring the brain’s response to sound stimuli function I = perso_I(X) %perso_I computes the Mutual Information of the response array X which has %dimensions Channels-by-Trials-by-Stimuli. The function returns the %Channels-by-1 vector I which contains the mutual information of each %array about the stimulus. This function makes the assumption that each %stimulus is equiprobable. % %This function computes the MI values without bias correction. It is %designed to imitate the function information.m from the ibTB toolbox. [Nch, Ntr, Nsti] = size(X); I = zeros(1,Nch); for ch=1:Nch Rch = binn_imitate(X(ch,:,:), [-pi -pi/2;-pi/2 0;0 pi/2;pi/2 pi]); Rbis=count_vectors(Rch); Joint=Rbis(2,:)/(Ntr*Nsti); %joint distribution of responses between them I(ch)=entropy_perso(Joint)-Cond_entropy_perso(Rch,1/Nsti*ones(1,Nsti)); end end function output = MI(P,Q,J) %Q is the 1-by-n probability distribution of S %P is the 1-by-m probability distribution of R %J is the 1-by-mn joint distribution of (R,S) %Computing the information contained in the Stimulus and in the Response, %or equivalently, their respective entropies, and the joint entropy. H_S=entropy_perso(Q); %where the function log_2 is the base-2 logarithm H_R=entropy_perso(P); H_J=entropy_perso(J); %Computing the Mutual Information output=H_R+H_S-H_J; end Adrian Radillo 33 IDH – Project Report Measuring the brain’s response to sound stimuli Toolbox call function I = Toolbox_I(X) %Toolbox_I computes the Mutual Information of the response array X % Given a Channels-by-Trials-by-Stimuli array X, returns the % Channels-by-1 vector I which contains the mutual information of each % array about the stimulus. This function makes the assumption that each % stimulus is equiprobable. % This function uses the ibTB toolbox' function information, and computes % the MI values without bias correction. [Nch, Ntr, ignore] = size(X); opts.nt = Ntr; opts.method = 'dr'; opts.bias = 'naive'; I = zeros(1,Nch); %empty variable for mutual information for ch=1:Nch Rch = binr(X(ch,:,:), Ntr, 4, 'eqspace', [-pi pi]); I(ch) = information(Rch, opts, 'I'); end end Adrian Radillo 34 IDH – Project Report Measuring the brain’s response to sound stimuli Workflow function ph = phase(x) %extracts the phase ph(t) from a signal x(t) using the hilbert transform ph=angle(hilbert(x)); end %%%%%%%% %%%%%%%% %%%%%%%% %%%%%%%% %%%%%%%% This script is designed to follow a workflow similar to the one presented in the article: "A mutual information analysis of neural coding of speech by low-frequency MEG phase information", from Cogan & Poeppel It makes use of the information breakdown toolbox ibTB %===============================VARIABLES================================== %creating/importing an input waveform X of dimensions M-by-N-by-O %corresponding to Channels-by-Trials-by-Stimulus %-----------------------------------%%%the user needs to import or create X here%%% %-----------------------------------[Nch Ntr Nsti]=size(X); %Note that in Cogan&Poeppel article, the number of %stimuli equals the number of time points %creating the frequency variables fs=1000; %given sampling frequency f_Nyq=fs/2; %Nyquist frequency d=4; %down-sampling factor (must divide fs) fs_new=fs/d; %new sample frequency (for further resampling) Nbands=3; %Number of frequency bands that we will analyse %creating the time variables t_init=(0:Nsti-1)/fs; %initial time vector t_rs=(0:(Nsti/d-1))/fs_new;%new time vector for further %resampling k=11; %absolute time limit in sec of the response that we are %interested in Nsti_fin=k*fs_new; %size of the down-sampled response (for 1 channel, %1 trial) t_fin=t_rs(1:Nsti_fin); %final time vector %creating empty arrays for the filtered signals and for the %Mutual Information values X_filt=zeros(Nch,Ntr,Nsti,Nbands); %filtered signals X_resized=zeros(Nsti_fin,Ntr); MutInf=zeros(Nch,Nbands); %there is an MI value for each channel and %each frequency band %toolbox options opts.nt=Ntr; opts.method='dr'; %direct method opts.bias='qe'; %quadratic extrapolation correction technique %see next page for the rest of the script … Adrian Radillo 35 IDH – Project Report Measuring the brain’s response to sound stimuli … %============================BAND PASS FILTER============================== %we create in this section a bandpass filter with Nbands bandpasses %[1-3;3-5;5-7] in Hz %setting the absolute frequencies values of the bandpass (in Hz) fmin_abs=[1 3 5]; fmax_abs=[3 5 7]; %conversion into normalised frequency bounds fmin_norm=fmin_abs/f_Nyq; fmax_norm=fmax_abs/f_Nyq; %we choose an order of 814 as done in the source article: b=[fir1(814,[fmin_norm(1) fmax_norm(1)]); fir1(814,[fmin_norm(2) fmax_norm(2)]); fir1(814,[fmin_norm(3) fmax_norm(3)])]; %=============================SIGNAL PROCESSING============================ for band=1:Nbands for ch=1:Nch for trial=1:Ntr %FILTERING (zero-phase shift with filtfilt) X_filt(ch,trial,:,band)=filtfilt(b(band,:),1,X(ch,trial,:)); %RESAMPLING/SELECTING TIME-WINDOW ts=timeseries(X_filt(ch,trial,:,band),t_init);%initial timeseries ts_rs=resample(ts,t_rs); %down-sampled time series y=squeeze(ts_rs.data)'; %down-sampled data X_resized(:,trial)=y(1:Nsti_fin); %dimension [stimulus x trials] end %===================MUTUAL INFORMATION============================= %PHASE EXTRACTION/BINNING %phi has dimension [ch x Nsti_fin x Ntr]: phi=reshape(phase(X_resized)',1,Ntr,Nsti_fin); X_binned= binr(phi, Ntr, 4, 'eqspace', [-pi pi]);%binned response %Computing the mutual information, and putting negatives values to 0 MutInf(ch,band)=max(information(X_binned,opts,'I'),0); end end Adrian Radillo 36 IDH – Project Report Measuring the brain’s response to sound stimuli Influence of the number of bins on the phase mutual information value % We analyse the impact of changing the number of bins for Magri et al. 's % phase data. % Loading DATA matrix of size <no. points x no.trials x no. channels> scriptDir = which('EEG_phase_example'); % Removing the word 'EEG_TEST' from the returned path: scriptDir = scriptDir(1:end-20); temp = load(fullfile(scriptDir, 'data', 'EEG_data_phase')); data = temp.data; clear temp; [time, Nt, Nch] = size(data); % Setting information theoretic analysis options: opts.nt = Nt; opts.method = 'dr'; opts.bias = 'qe'; K=100; %number of different cases we want to investigate HR = zeros(Nch, K); HRS = zeros(Nch, K); %We put the response matrix in the form Channels x trials x stimulus R=permute(data,[3 2 1]); for ch=1:Nch Rch = R(ch,:,:); Rch_new=zeros(5,Nt,time); for k=1:K Rch_new(k,:,:)= binr(Rch, Nt, k+1, 'eqspace', [-pi pi]); [HR(ch,k),HRS(ch,k)] = entropy(Rch_new(k,:,:), opts, 'HR', 'HRS'); end end I=zeros(Nch,5);%mutual information variable for k=1:K I(:,k)=HR(:,k)-HRS(:,k); end plot(2:K+1,I(60,:),'.');%plotting only the most informative channel which %is the channel 60. title('mutual information of one channel as a function of the number of phase bins','fontsize',20); xlabel('number of bins','fontsize',20); ylabel('mutual information in bits','fontsize',20); Adrian Radillo 37 IDH – Project Report Measuring the brain’s response to sound stimuli Correction technique applied to a confusion-matrix %Computing the MI values with bias correction techniques from the %consonants-confusion matrix provided in the article: %"Perceptual Confusions among Consonants, Revisited - Cross-Spectral %Integration of Phonetic-Feature Information and Consonant Recognition", %from (Christiansen & Greenberg, 2012) %for the 1500Hz slit we have for consonants: C_cons=[7 11 2 3 4 2 1 1 3 1 1 4 15 2 1 5 0 2 2 3 1 1 7 7 12 3 1 1 1 0 3 0 1 1 0 0 14 11 3 0 0 3 3 1 0 0 1 5 24 5 0 0 0 0 1 1 0 2 9 9 7 0 1 3 0 4 1 3 1 0 1 0 13 3 1 6 7 1 7 0 1 1 0 6 9 3 2 6 0 2 2 0 2 3 0 3 16 4 4 0 0 0 2 1 1 0 1 0 24 7 0 0 0 1 2 3 1 0 0 11 18]; %probability distribution of the stimuli: P_stim_cons=1/11*ones(1,11); %conditional probabilities: C_cons_cond=C_cons/36; %probability distribution of the responses: P_resp_cons=sum(C_cons_cond)/11; %joint probability distribution P_joint_cons=reshape(C_cons/396,1,121); %Naive computation of the "normalized" mutual information MI_naive=MI(P_resp_cons,P_stim_cons,P_joint_cons)/3.46; %%%%%%%%%%% BIAS CORRECTION TECHNIQUES USING TOOLBOX %creation of the response matrix for input of the toolbox: %We artificially put the response matrix in the format: %channel-by-trials-by-stimulus Response=zeros(1,36,11); h=1; %couter for filling the 2nd dimension of the array Response for i=1:11 %stimulus index for j=1:11 %response index count=C_cons(i,j); while count>0 Response(1,h,i)=j; count= count -1; if h<36 h=h+1; else h=1; %resetting end end end end %toolbox call opts.method='dr'; opts.bias='qe'; opts.nt=36; MI_bias_corr=information(Response,opts,'I')/3.46; %bar plot bar([MI_naive, MI_bias_corr]); title('comparison between two calculation techniques','fontsize',20); set(gca,'XTickLabel',{'naive','quadratic extrapolation'},'fontsize',15); Adrian Radillo 38