Final Report

advertisement
Neural encoding of sound stimuli
A mutual information analysis of EEG signals
Date: 06/08/2012
Author: Adrian Radillo - Intern at the Institute of Digital Healthcare
Supervisor: Dr. James Harte
IDH – Project Report
Measuring the brain’s response to sound stimuli
Contents
Introduction............................................................................................................................................. 3
1
2
Computing mutual information from EEG signals ........................................................................... 3
1.1
Mutual information ................................................................................................................. 3
1.2
EEG signal recorded in response to sound stimuli .................................................................. 3
1.3
Common meanings of the mutual information in neuroscience ............................................ 3
1.4
Formalism ................................................................................................................................ 4
1.4.1
Stimuli .............................................................................................................................. 4
1.4.2
Response ......................................................................................................................... 4
1.4.3
Empirical data and probability estimates ........................................................................ 5
1.4.4
Calculations ..................................................................................................................... 5
Two examples .................................................................................................................................. 6
2.1
Magri et al., 2009 .................................................................................................................... 6
2.2
Cogan & Poeppel, 2011 ........................................................................................................... 6
3
Workflow ......................................................................................................................................... 7
4
Interpreting mutual information for sensory neuroscience ........................................................... 9
5
Another use of Mutual Information: Consonants perception ...................................................... 10
5.1
Variables ................................................................................................................................ 10
5.2
Experiment ............................................................................................................................ 11
5.3
Data analysis .......................................................................................................................... 11
5.4
Results ................................................................................................................................... 12
5.5
Personal remarks ................................................................................................................... 13
Conclusion ............................................................................................................................................. 13
Bibliography........................................................................................................................................... 14
Appendix A: Information Theory ........................................................................................................... 15
Appendix B: Information Breakdown toolbox....................................................................................... 29
Appendix C: Matlab Codes .................................................................................................................... 31
Adrian Radillo
2
IDH – Project Report
Measuring the brain’s response to sound stimuli
Introduction
Our project’s aim is to answer the following question: How can the notion of mutual information
be used to interpret EEG data, in studies about the auditory system?
After having defined the notion of mutual information and provided some knowledge about EEG
signals, we present the admitted theoretical framework that makes use of mutual information in
auditory neuroscience. In particular, we present two studies (Magri, Whittingstall, Singh, Logothetis,
& Panzeri, 2009; Cogan & Poeppel, 2011) and their common workflow. Then, we take a look at
another use of mutual information concerning consonant perception. Eventually, we discuss
substantive questions about mutual information such as its scope and limitations in the neuroscience
of the auditory system.
1 Computing mutual information from EEG signals
1.1 Mutual information
We provide in this section only condensed information about the mutual information.
Therefore, the reader is invited to consult the Appendix A for detailed information on the subject.
Let 𝑅 and 𝑆 be two discrete random variables with finite range. Then the mutual information
between these two random variables is denoted by 𝐼(𝑅, 𝑆) and represents “…the decrease of
uncertainty [of 𝑆] due to the knowledge of 𝑅, or as the information about 𝑆 which can be gained
from the value of 𝑅” (Rényi, 2007, p. 558). Moreover mutual information is a symmetric quantity.
That is to say, “…the value of 𝑅 gives the same amount of information about 𝑆 as the value of 𝑆 gives
about 𝑅.” (Rényi, 2007, p. 558)
1.2 EEG signal recorded in response to sound stimuli
An EEG signal is usually made of time-varying voltage in several electrodes (or channels) in
parallel. It is possible to extract several discrete and finite quantities from each one of these
channels. Common variables are the amplitude, the frequency and the phase of the signal. As far as
the frequency is concerned, it is common to distinguish the following frequency bands: the delta
band 0.5-4 Hz, the theta band 4-7.5Hz, the alpha band 8-13Hz and the beta band 14-26Hz (Sanei &
Chambers, 2007, pp. 10–11). In what follows, we pay a special attention to the phase in the delta and
theta bands, as it has proved to “play an important role in auditory perception” (Cogan & Poeppel,
2011, p. 554).
1.3 Common meanings of the mutual information in neuroscience
For a finite and discrete valued stimulus S, and a corresponding finite and discrete valued neural
response R, we present common meanings in the literature for the quantity I(S,R), which is the
mutual information of the stimulus and the response:
ο‚·
“Responses that are informative about the identity of the stimulus should exhibit larger
variability for trials involving different stimuli than for trials that use the same stimulus
Adrian Radillo
3
IDH – Project Report
Measuring the brain’s response to sound stimuli
repetitively. Mutual information is an entropy-based measure related to this idea.” (Dayan &
Abbott, 2001, p. 126)
ο‚·
“The mutual information measures how tightly neural responses correspond to stimuli and
gives an upper bound on the number of stimulus patterns that can be discriminated by
observing the neural responses.” (Schneidman, Bialek, & Berry II, 2003, p. 11540)
ο‚·
“Mutual information quantifies how much of the information capacity provided by stimulusevoked differences in neural activity is robust to the presence of trial-by-trial response
variability. Alternatively, it quantifies the reduction of uncertainty about the stimulus that
can be gained from observation of a single trial of the neural response.” (Magri et al., 2009,
p. 3)
We will discuss these interpretations later on in section 5. For the moment, we will describe the
framework and formalism that is used in the aforementioned studies.
1.4 Formalism
In what follows, we present, with our own notation, the formalism used in (Cogan & Poeppel,
2011; Dayan & Abbott, 2001; Magri et al., 2009; Schneidman et al., 2003). This formalism will be
applied here to a hypothetical experiment where a single subject is presented the same sequence of
stimuli over a fixed number of trials, while all the other parameters are controlled. The response is
recorded, for each trial, on a fixed time window which is locked to the stimulus presentation.
1.4.1
Stimuli
The stimuli are the same at each trial and take on 𝑛 distinct values over 𝑛 timesteps. This is a
strong assumption made by (Cogan & Poeppel, 2011; Magri et al., 2009) that allows them (and us in
this section) to interchangeably talk about timesteps, samples or stimuli. We therefore consider the
uniformly distributed discrete random variable 𝑆 which has finite range {𝑠1 , 𝑠2 , … , 𝑠𝑛 } (𝑛 ∈ β„•). For a
given trial we have1:
1
π‘žπ‘‘ ≔ 𝑃(𝑆 = 𝑠𝑑 ) = 𝑛 ;
1.4.2
(𝑑 = 1,2, … , 𝑛)
Response
If there are π‘‘π‘Ÿ trials, we consider the response for the π‘˜’th trial at the 𝑑’th timestep (𝑑 =
1,2, … , 𝑛) to be an EEG signal recorded over π‘β„Ž channels (or electrodes) in parallel. We write:
π‘˜
π‘…π‘˜ (𝑑) = (π‘Ÿ1π‘˜ (𝑑), π‘Ÿ2π‘˜ (𝑑), … , π‘Ÿπ‘β„Ž
(𝑑));
(π‘˜ = 1,2, … , π‘‘π‘Ÿ)
Since the response is assumed to be discrete, each response value at a given trial, for a given
electrode and a given sample, takes on one of π‘š distinct values. In mathematical notation we write:
∃ℜ ≔ {𝛼1 , 𝛼2 , … , π›Όπ‘š } s. t. , ∀π‘˜ ∈ {1,2, … , π‘‘π‘Ÿ}, ∀𝑖 ∈ {1,2, … , π‘β„Ž}, ∀𝑑 ∈ {1,2, … , 𝑛}, π‘Ÿπ‘–π‘˜ (𝑑) ∈ ℜ
1
The symbol ≔ represents a definition.
Adrian Radillo
4
IDH – Project Report
Measuring the brain’s response to sound stimuli
The responses are repeatedly recorded over π‘‘π‘Ÿ trials in order to estimate the probabilities of the
ideal response 𝑅. We will explain below how the probability distribution of this π‘β„Ž-dimensional
random variable can be estimated from the empirical responses π‘…π‘˜ (𝑑).
1.4.3
Empirical data and probability estimates
The empirical response recorded at the trial π‘˜ and at the timestep 𝑑 is in fact a response
following the presentation of the stimulus 𝑠𝑑 . This makes the empirical data ideally suited to estimate
conditional probabilities. More precisely, for any arbitrary vector 𝜢 ∈ ℜπ‘β„Ž within the response’s
𝜢
range we define π‘π‘‘π‘Ÿ
(𝑑) to be the number of times that the value 𝜢 is observed at the time 𝑑 among
all the trials. The estimated2 conditional probability of the neural response 𝑅 taking on the value 𝜢
given the stimulus 𝑠𝑑 is :
𝑃(𝑅 = 𝜢|𝑠𝑑 ) ≔ π‘πœΆ|𝑠𝑑 ≔
𝜢
π‘π‘‘π‘Ÿ
(𝑑)
π‘‘π‘Ÿ
The probability of the response taking on the value 𝜢 is then deduced by basic probability properties:
𝒏
π‘β„Ž
∀𝜢 ∈ ℜ , 𝑃(𝑅 = 𝜢) ≔ π‘πœΆ ≔ ∑ π‘πœΆ|𝑠𝑑 π‘žπ‘‘
𝒕=𝟏
Assuming that the range of the response 𝑅 has 𝑧 distinct elements, the last formula allows us to
compute the probability distribution of the response 𝑅, 𝒫 ≔ {𝑝1 , 𝑝2 , … , 𝑝𝑧 }.
We also want to define the probability distributions
1,2, … , 𝑛, that will be used in the next section.
1.4.4
𝒫𝑑 ≔ {𝑝1|𝑠𝑑 , 𝑝2|𝑠𝑑 , … , 𝑝𝑧|𝑠𝑑 }, for 𝑑 =
Calculations
It is now straightforward to compute the entropy of the response, the entropy of the response
for a fixed stimulus, and the mutual information3 between the stimulus and the response. The
formulas are respectively:
𝑧
𝐻(𝑅) = 𝐼(𝒫) = − ∑ π‘π‘˜ log 2 π‘π‘˜
π‘˜=1
𝑛
𝑛
𝑧
1
𝐻(𝑅|𝑆) = ∑ π‘žπ‘‘ 𝐼( 𝒫𝑑 ) = − ∑ ∑ π‘π‘˜|𝑠𝑑 π‘™π‘œπ‘”2 (π‘π‘˜|𝑠𝑑 )
𝑛
𝑑=1
𝑑=1 π‘˜=1
𝐼(𝑅, 𝑆) = 𝐻(𝑅) − 𝐻(𝑅|𝑆)
2
See Appendix B for more details about the bias involved in this estimation, and the known correction
techniques.
3
See (de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997, note 20) for computing error margins
for the mutual information estimation.
Adrian Radillo
5
IDH – Project Report
Measuring the brain’s response to sound stimuli
2 Two examples
2.1 Magri et al., 2009
Considering “…EEGs recorded from a male volunteer with a 64-channel electrode cap while the
subject was fixating the screen during the repeated presentation of a 5 s-long naturalistic color movie
presented binocularly”4 (Magri et al., 2009, p. 17), the authors investigated “…which frequency
bands, which signal features (phase or amplitude), and which electrode locations better encoded the
visual features present in movies with naturalistic dynamics” (Magri et al., 2009, p. 17).
With the notation introduced in section 1.4, the authors defined the variables as follows:
ο‚·
𝑛 = 625; π‘‘π‘Ÿ = 31; π‘β„Ž5 = 1; π‘š = 4
ο‚·
ℜπ‘β„Žπ‘Žπ‘ π‘’ = { [−πœ‹, − 2 ), [− 2 , 0), [0, 2 ), [ 2 , πœ‹] } ; where the unit is radians.
ο‚·
ℜπ‘π‘œπ‘€π‘’π‘Ÿ = { [0, 18), [18, 44),
πœ‹
πœ‹
πœ‹
πœ‹
[44, 84), [84, 3097] } ; where the unit is πœ‡π‘‰ 2.
Given this, they computed 𝐼(𝑅, 𝑆) for four different frequency bands of the signal (delta, theta,
alpha and beta bands), for the 64 electrodes, and for the two signal features (phase and power).
They were looking for the frequency bands, the channels and the signal feature which presented
the highest mutual information values. They “…found that only low frequency ranges (delta) were
informative, and that phase was far more informative than power at all electrodes” (Magri et al.,
2009, pp. 18–20). Moreover, the most informative channel seems to be the electrode PO8.
2.2 Cogan & Poeppel, 2011
We will only present here an outline of the study from (Cogan & Poeppel, 2011). The authors
were interested in determining:
-
-
whether the Delta, Thetalow and Thetahigh subbands from a MEG6 signal of a subject listening
to speech sentences carry more information about the stimulus (mutual information) than
higher frequency bands.
whether the speech informations which are processed in these three subbands are
independent or not (in which case they would be redundant).
We present below the part of the study dealing with the first indent only. The authors presented
three different recorded English sentences, 32 times each, to 9 subjects undergoing a MEG recording
through 157 channels7. The first 11 seconds of each presentation were analysed.
With the notation introduced in section 1.4, the authors defined the variables as follows:
ο‚·
𝑛 = 2,750; π‘‘π‘Ÿ = 32; π‘β„Ž8 = 1; π‘š = 4
4
See (Magri, Whittingstall, Singh, Logothetis, & Panzeri, 2009) for a full description of the experiment.
Notice that the authors perform a mutual information analysis on each channel individually.
6
Althought this study has not been performed with EEG data, their methodology is still relevant for it.
7
See (Cogan & Poeppel, 2011) for full details on the experimental setting. See also section 3 for the schematic
workflow that the authors followed to process their data.
8
Notice that the authors perform a mutual information analysis on each channel individually.
5
Adrian Radillo
6
IDH – Project Report
ο‚·
Measuring the brain’s response to sound stimuli
πœ‹
πœ‹
πœ‹
πœ‹
ℜ = { [−πœ‹, − ), [− , 0), [0, ), [ , πœ‹] } ; where the unit is radians.
2
2
2
2
Given this, they computed 𝐼(𝑅, 𝑆) for 24 different frequency bands of the signal (ranging from 0
Hz to 50 Hz), for the 157 electrodes, and for the 3 sentences. A general workflow similar to the one
they applied to the MEG signal is described in section 3.
Part of the results from this study are:
-
The mutual information is higher across all subjects, trials and sentences, for the three
subbands Delta, Thetalow and Thetahigh than for higher frequency bands.
The highest values of mutual information in the three frequency subbands appeared to be on
electrodes corresponding to “auditory regions in superior temporal cortex” (Cogan &
Poeppel, 2011, p. 558).
3 Workflow
Beyond the mere theoretic results from the previous studies, we analysed their workflow
concerning the processing of the EEG/MEG signals. Since they were similar in many aspects, we
present in Figure 1 a standard workflow which steps are common to both studies. The reader is
invited to consult Appendix C for a detailed Matlab script that implements this workflow. We now
comment the steps of this workflow.
0 – An EEG signal π‘₯(𝑑) is recorded at a fixed sample frequency on a time window βˆ†π‘‘ following the
presentation of a stimulus9, over π‘β„Ž channels, π‘‘π‘Ÿ trials and π‘˜ subjects. The workflow of Figure 1 is
applied repeatedly along all channels, trials and subjects.
1 – The signal π‘₯(𝑑) is filtered with Finite Impulse Response (FIR) filters into the frequency bands that
we want to investigate: π‘₯π‘π‘Žπ‘›π‘‘1 (𝑑), π‘₯π‘π‘Žπ‘›π‘‘2 (𝑑), etc. It is important to note that the filtering must not
affect the value of the instantaneous phase of the signal, since the final computations will involve
those values.
2 – If necessary, the signal can be down-sampled. This has the main effect of reducing the size of the
data, and therefore increasing the computational speed.
3 – It is also likely that only a specific time window, smaller than βˆ†π‘‘, is of interest for the analysis. The
unnecessary data is therefore cut at this stage.
4 – The instantaneous phase (between – πœ‹ and πœ‹) is extracted from the analytical signal π‘₯Μƒ(𝑑):
πœƒπ‘π‘Žπ‘›π‘‘π‘– (𝑑) = arg10(π‘₯Μƒπ‘π‘Žπ‘›π‘‘π‘– (𝑑))
5 – Using the ibTB toolbox (see Appendix B), the phases of each frequency band are binned into 4
equally spaced interval from – πœ‹ to πœ‹.
9
Recall that in both studies (Cogan & Poeppel, 2011; Magri et al., 2009) each time step is (assumed to be)
associated to a different stimulus.
10
Where arg means the argument of the complex analytical signal.
Adrian Radillo
7
IDH – Project Report
Measuring the brain’s response to sound stimuli
Figure 1 - Standard Workflow
0 - EEG signal from 1 subject, 1
channel & 1 trial: x(t)
1 - Bandpass 0-phase shift FIR
filters
2 - Down-sampling for
computational speed
3 - Selecting part to analyse
4 - Phase extraction
Information Breakdown Toolbox
5
Binning
6
Mutual information
computation with bias
correction techniques
7 – Mutual information values for
each frequency band
Adrian Radillo
8
IDH – Project Report
Measuring the brain’s response to sound stimuli
6/7 – The mutual information values (and other desired information-theoretic quantities) are
computed using bias correction techniques with the ibTB toolbox (see Appendix B for more details).
4 Interpreting mutual information for sensory neuroscience
In this section, we aim to ask some naïve questions about the meaning that can safely be
attached to the mutual information as described above, with no pretention of providing with
definitive answers.
The first topic that we want to question is the fact that the stimulus is assumed to be different at
each timestep. Citing (de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997), (Kayser,
Montemurro, Logothetis, & Panzeri, 2009) write11 : “This formalism makes no assumptions about
which features of the stimulus elicit the response and hence is particularly suited to the analysis of
complex naturalistic stimuli.” However, we notice at least two potential issues with this way of
proceeding.
ο‚·
Firstly, it relates the information contained in the stimulus to the sample frequency at which
the response is recorded. The higher the sample frequency, the greater the amount of
information contained in the stimulus is. Theoretically, as the sample frequency tends to
infinity, so does the amount of information contained in the stimulus. This doesn’t make
much sense in our opinion since for a given presentation of a stimulus in an experiment, the
amount of information it contains should be a fixed number. Notice also that resampling of
the response’s signal in the workflow presented above has an effect on the stimulus’
information. Note finally that the information theory of continuous random variables (Cover
& John Wiley & Sons, 1991; Dayan & Abbott, 2001) might provide with some solutions to the
aforementioned theoretical problem of the information tending to infinity.
ο‚·
Secondly, the stimulus feature(s) that convey (in reality) the information encoded in the
neural response might take the same value several times on the time window it is presented.
If this is the case, its probability distribution over this time window is not uniform anymore.
Not specifying further the stimulus can therefore lead to a miscalculation fo its information.
Our second topic of interest in this section is the discretization method of the neural response.
In particular, we wonder what is the impact of the number of bins into which the (phase) response is
binned. Using the phase data provided by (Magri et al., 2009), we computed the mutual information
value12 of the highest informative electrode (PO8) for 100 differents possibilities of binning13. The
results are the following:
11
At the page 4 of their “supplemental data” file available at:
http://www.cell.com/neuron/supplemental/S0896-6273(09)00075-0
12
Using the ibTB toolbox and the QE bias correction method. The Matlab script is available in appendix C.
13
This data consists of phase values between – πœ‹ and πœ‹ over 625 timesteps. We basically always binned the
response in the interval [−πœ‹, πœ‹], changing only the number of equispaced bins, from 2 to 103.
Adrian Radillo
9
IDH – Project Report
Measuring the brain’s response to sound stimuli
Figure 2 – Influence of the number of bins on the mutual information value
It clearly appears that the mutual information is an increasing function of the number of bins.
This leads directly to the question: “what is the correct number of bins?” Although we did not
ivnestigate furhter the answer to this question, it seems a necessity for further research.
5 Another use of Mutual Information: Consonants perception
The notion of mutual information has also been used in other contexts relating to hearing
research. In particular, (Christiansen & Greenberg, 2012; Miller & Nicely, 1955) have applied it to
behavioral research on consonant perception. We will describe below14 a simplified version of the
methodology that these studies have in common, using our own notations to make them
comparable. We will further summarize their main results.
5.1 Variables
Both studies are investigating the perception of a finite number of distinct consonants – 16
English consonants in (Miller & Nicely, 1955) and 11 Danish ones in (Christiansen & Greenberg,
2012). We denote this set of 𝑛 consonants by 𝐢 ≔ {𝑐1 , 𝑐2 , … , 𝑐𝑛 }.
In addition to this, the authors operate several partitions of 𝐢 according to different phonetic
properties that the consonants hold. These properties were: Voicing, Nasality, Affrication, Duration
and Place of Articulation in (Miller & Nicely, 1955) and Voicing, Manner and Place in (Christiansen &
Greenberg, 2012)15. We denote the π‘š partitions by 𝐷1 , 𝐷2 , … , π·π‘š . For instance, if 𝐷1 is the partition
corresponding to the Voicing property of the consonants, and if {𝑐1 , 𝑐5 , 𝑐8 , 𝑐9 , 𝑐𝑛 } are voiced and the
rest are voiceless, then we define 𝐷11 ≔ {𝑐1 , 𝑐5 , 𝑐8 , 𝑐9 , 𝑐𝑛 } and 𝐷12 ≔ 𝐢\𝐷11 so that 𝐷1 = {𝐷11 , 𝐷12 }.
14
15
The reader can consult appendix A if some details from the following section remain unclear to him.
See the cited articles for detailed explanation of these properties.
Adrian Radillo
10
IDH – Project Report
Measuring the brain’s response to sound stimuli
The perception of the consonants (and of their features) is investigated under different
conditions. In particular, the authors vary the frequency band of the auditory stimulus16.
The physical stimuli are always the consonants. However, the theoretical stimuli considered can
either be the consonants or their features. The probabilities of the stimuli are fixed by the
experimenters. For instance, the probability of any consonant appearing across all subjects, for a
1
given condition, is 𝑛. And from this, the probabilities of any element of the partitions 𝐷1 , … , π·π‘š can
be deduced17.
5.2 Experiment
In the experiment, the consonants are spoken18 a fixed number of times to the subjects under
each condition. At each presentation, the subjects have to tick one of 𝑛 boxes to tell which
consonant they heard. For each condition, these answers can be summarized, across subjects, in
“confusion matrices” which can be described as follows:
-
-
They are square matrices
Each row corresponds to a different stimulus. The stimulus can either be the 𝑛 consonants,
or the elements of any of the π‘š partitions. Therefore, π‘š + 1 confusion-matrices are
generated for each condition.
Each column corresponds to the perceived element that the subjects reported.
The entry (𝑖, 𝑗) of the matrix is the number of times that the element 𝑗 was reported when
the stimulus 𝑖 was presented, across all subjects. Therefore, the main diagonal entries
represent the frequencies of correct perceptions.
5.3 Data analysis
Summing the entries of the main diagonal of a confusion matrix and dividing the result by the
sum of all the entries of the matrix gives the “recognition score” between 0 and 1 for the feature and
the condition to which the matrix corresponds. Both studies claim that an information theoretic
analysis of the results involving mutual information is finer than an analysis of the recognition scores
because the former takes into account the “error patterns” made by the subjects, as we will explain
in 4.5.
Consider the stimuli to be represented by the random variable 𝑆 with range {𝑠1 , 𝑠2 , … , 𝑠𝑛 } and
the response β„› with range {π‘Ÿ1 , π‘Ÿ2 , … , π‘Ÿπ‘› }. We said previously how the probability distribution 𝒬 ≔
{π‘ž1 , π‘ž2 , … , π‘žπ‘› } of the stimulus is known. The joint distribution of the response and the stimulus:
𝒲 ≔ {𝑀11 , 𝑀12 , … , 𝑀𝑖𝑗 , … , 𝑀𝑛𝑛 } – where 𝑀𝑖𝑗 = 𝑃(𝑆 = 𝑠𝑖 , β„› = π‘Ÿπ‘— ) – can be found by dividing the
confusion matrix for a given condition by the sum of all its entries19. We deduce from this the
probability distribution of the response: 𝒫 = {𝑝1 , 𝑝2 , … , 𝑝𝑛 } ; where 𝑝𝑗 = ∑𝑛𝑖=1 𝑀𝑖𝑗 .
16
(Miller & Nicely, 1955) also vary the signal-to-noise ratio of the stimulus.
𝛼
𝛼
𝐼𝑓 𝐷1 = {𝐷11 , 𝐷12 } π‘Žπ‘›π‘‘ (|𝐷11 |, |𝐷12 |) = (𝛼, 𝑛 − 𝛼), π‘‘β„Žπ‘’π‘› 𝑃(𝐷11 ) = π‘Žπ‘›π‘‘ 𝑃(𝐷12 ) = 1 −
𝑛
𝑛
18
In lists of nonsense CV syllables in (Miller & Nicely, 1955), and in CV syllables embedded in carrier sentences
in (Christiansen & Greenberg, 2012).
19
Notice that this proability distribution and the following one are only estimated, since computed with a finite
number of samples.
17
Adrian Radillo
11
IDH – Project Report
Measuring the brain’s response to sound stimuli
The mutual information comes finally as: 𝐼(β„›, 𝑆) = 𝐻(𝒫) + 𝐻(𝒬) − 𝐻(𝒲)
In order to compare the different mutual information values that they computed, the authors
systematically normalized them by the entropy of the stimulus. Hence, they considered the ratio:
𝐼(β„›,𝑆)
𝐼̂ ≔
. We assume that the authors divided by 𝐻(𝒬) (and not by min( 𝐻(𝒫), 𝐻(𝒬)) as would
𝐻(𝒬)
mathematically be expected20) because they interpreted 𝐼(β„›, 𝑆) to be the amount of information
transmitted during the perception task. Therefore, 𝐼̂ becomes the fraction of information about the
stimulus that is transmitted during the task.
So, to sum up, for each condition (frequency band or signal-to-noise ratio), π‘š + 1 normalized
mutual information values are computed (one by confusion matrix), corresponding to the perception
of the consonants and to the decoding of the π‘š phonetic features.
5.4 Results
Considering the confusion matrix for the stimuli 𝐢 presented under a given condition, we
provide some clues about the meaning of 𝐼̂. It lies between 0 and 1. When 𝐼̂ = 𝛼 (0 ≤ 𝛼 ≤ 1), the
representative subject’s responses contains 𝛼 fraction of the stimuli’s information. But the reader
must be aware of the fact that this is not equivalent to saying that the representative has correctly
perceived 100𝛼% of the stimuli (i.e. the recognition score is of 100%). The reason for this is that the
calculation of 𝐼̂ involves the probabilities of all the responses, correct and incorrect. Therefore, 𝐼̂
captures the systematicity of the errors in addition to the correct answers. For instance, imagine that
all the subjects in the experiment systematically confuse the “t” with the “d”, but provide correct
answers for all the other consonants. Then, the recognition score will not be 100% but the
normalized information value will be equal to 1. And this idea generalises to less systematic errors. If
when hearing a “t”, the subjects are more likely to confuse it with a “d” than with a “k”, all other
caracteristic of the response being equal, then the normalized mutual information value multiplied
by 100 will be higher than the recognition score. This is what leads the authors of both articles to say
that 𝐼̂ is sensitive to error patterns. However, the simple knowledge of 𝐼̂ on its own does not allow a
qualitative analysis of these patterns. A direct analysis of the confusion-matrix entries is necessary
for this. Beyond these theoretical remarks, the form of the results from the two cited studies are
presented below.
(Miller & Nicely, 1955) consider each features 𝐷𝑖 as a communication channel. High (normalized)
mutual information value for the channel 𝐷𝑖 across different conditions of noise or frequency
distortion indicates a robustness of this communication channel to noise or frequency distortion. In
this case, the authors also say that the feature 𝐷𝑖 is “discriminable”. A low mutual information value
indicates a vulnerability of the feature to noise or frequency distortion and the authors say the
feature to be “non-discriminable”.
(Christiansen & Greenberg, 2012)’s conditions consisted of seven frequency bands
(𝑓1 ; 𝑓2 ; 𝑓3 ; 𝑓1 + 𝑓2 ; 𝑓1 + 𝑓3 ; 𝑓2 + 𝑓3 ; 𝑓1 + 𝑓2 + 𝑓3 ). Considering that the stimuli were the same in
thoes seven conditions, the authors compared the mutual information values of the last four
20
Indeed, since 0 ≤ 𝐼(𝑅, 𝑆) ≤ min(𝐻(𝑅), 𝐻(𝑆)), it is possible that 𝐻(𝑅) < 𝐻(𝑆) and that in consequence, the
𝐻(𝑅)
ratio 𝐼̂ becomes bounded above by the ratio
< 1.
𝐻(𝑆)
Adrian Radillo
12
IDH – Project Report
Measuring the brain’s response to sound stimuli
conditions with those of the first three (mutli-bands vs single-band). They where investigating
whether 𝐼̂∑ 𝑓𝑖 = ∑ 𝐼̂𝑓𝑖 . This property is defined by the authors as the linearity of the cross-spectral
integration of the phonetic information. They mainly found that the integration of the information
corresponding to the feature Place is highly non-linear.
5.5 Personal remarks
In computing their value 𝐼̂, none of the above study applied bias correction techniques. (Miller &
Nicely, 1955, p. 348) state explicitly: “Like most maximum likelyhood estimates, this estimate will be
biased to overestimate [𝐼(β„›, 𝑆)] for small samples; in the present case, however, the sample is large
enough that the bias can safely be ignored.” The authors provide no more justification for their
statement, and although we haven’t investigated the matter in detail, it would be prudent to seek for
stronger justifications.
The ibTB toolbox allows the user to apply these bias correction techniques to the calculation of
𝐼̂. A Matlab script implementing this calculation21 for the consonant-confusion matrix provided in
(Christiansen & Greenberg, 2012) – Table II (a) – has been written (see appendix C ). We found that
the “naïve” 𝐼̂ had the value of 0.2898 whereas the bias corrected one showed 0.2195.
Another criticism that we want to make concerning (Christiansen & Greenberg, 2012) is the
following. When comparing the 𝐼̂ from different conditions (frequency bands combinations), they do
not provide with the statistical significance of the differences they observe. They use expressions as
vague as “close to”, “not quite linear”, “increases only slightly”, “only slightly less compressive”,etc.
(Christiansen & Greenberg, 2012, p. 154). We consider that a thorough investigation of the
significance of these differences is required, and constitutes therefore a path for further studies.
Conclusion
We have seen how mutual information is starting to be used in interpreting EEG/MEG signals in
sensory neuroscience, with a theoretical framework steming from spike trains analysis (de Ruyter van
Steveninck et al., 1997). We presented this framework with our own notations in order to capture all
the mathematical details that might not appear in the cited articles. Concerning EEG/MEG responses,
the low-frequency phase response is deemed to carry more information about sensory stimuli (visual
and auditory, depending on the experiment) than higher-frequency responses. We implemented the
standard workflow in Matlab that leads to those results. However, we noticed several potential
weaknesses in the theoretical interpretation of the mutual information values that encourage to
further investigate this field of research. Finally, we aknowledged that information theory, and in
particular mutual information analysis, was also used in the cognitive/behavioral study of speech
perception. This diversified applicability of the notion of mutual information suggests that furhter
research about the link between sensory stimulation, neurophysiological recordings and behavioral
response (perception) is attainable.
21
Note however that since we haven’t studied in details the effect of the bias correction technique that we
used (called quadratic extrapolation), our calculations might be flawed. Further verifications are therefore
required.
Adrian Radillo
13
IDH – Project Report
Measuring the brain’s response to sound stimuli
Bibliography
Christiansen, T. U., & Greenberg, S. (2012). Perceptual Confusions Among Consonants, Revisited-Cross-Spectral Integration of Phonetic-Feature Information and Consonant Recognition. IEEE
Transactions on Audio, Speech, and Language Processing, 20(1), 147–161.
Cogan, G. B., & Poeppel, D. (2011). A mutual information analysis of neural coding of speech by lowfrequency MEG phase information. Journal of Neurophysiology, 106(2), 554–563.
Cover, T. M., & John Wiley & Sons, I. (1991). Elements of Information Theory. Wiley-Interscience.
Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscienceβ€―: computational and mathematical
modeling of neural systems. Cambridge, Mass.: Massachusetts Institute of Technology Press.
de Ruyter van Steveninck, R. R., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, W. (1997).
Reproductibility and Variability in Neural Spike Trains. Science, 275, 1805–1808.
Kayser, C., Montemurro, M. A., Logothetis, N. K., & Panzeri, S. (2009). Spike-Phase Coding Boosts and
Stabilizes Information Carried by Spatial and Temporal Spike Patterns. Neuron, 61(4), 597–
608. doi:10.1016/j.neuron.2009.01.008
Magri, C., Whittingstall, K., Singh, V., Logothetis, N. K., & Panzeri, S. (2009). A toolbox for the fast
information analysis of multiple-site LFP, EEG and spike train recordings. BMC Neuroscience,
10(1), 81.
Miller, G. A., & Nicely, P. E. (1955). An Analysis of Perceptual Confusions Among Some English
Consonants. The Journal of the Acoustical Society of America, 27(2), 338–352.
Nemenman, I., Bialek, W., & de Ruyter van Steveninck, R. (2004). Entropy and information in neural
spike trains: Progress on the sampling problem. Physical Review E, 69(5).
Panzeri, S., Senatore, R., Montemurro, M. A., & Petersen, R. S. (2007). Correcting for the Sampling
Bias Problem in Spike Train Information Measures. Journal of Neurophysiology, 98(3), 1064–
1072.
Pola, G., Thiele, A., Hoffmann, K.-P., & Panzeri, S. (2003). An exact method to quantify the
information transmitted by different mechanisms of correlational coding. NETWORK:
COMPUTATION IN NEURAL SYSTEMS, (14), 35–60.
Rényi, A. (2007). Probability theory. Mineola, N.Y.: Dover Publications.
Sanei, S., & Chambers, J. (2007). EEG signal processing. Chichester, England; Hoboken, NJ: John Wiley
& Sons.
Schneidman, E., Bialek, W., & Berry II, M. J. (2003). Synergy, Redundancy, and Independence in
Population Codes. The Journal of Neuroscience, 23(37), 11539–11553.
Strong, S., Koberle, R., de Ruyter van Steveninck, R., & Bialek, W. (1998). Entropy and Information in
Neural Spike Trains. Physical Review Letters, 80(1), 197–200.
Adrian Radillo
14
IDH – Project Report
Measuring the brain’s response to sound stimuli
Appendix A: Information Theory
Contents
Introduction........................................................................................................................................... 16
1
Discrete Random Variable with finite range ................................................................................. 16
2
Information.................................................................................................................................... 18
3
2.1
Elucidation of key terms ........................................................................................................ 18
2.2
Mathematical definition of information ............................................................................... 19
2.2.1
Abstract examples ......................................................................................................... 20
2.2.2
Practical examples ......................................................................................................... 21
Study of two random variables ..................................................................................................... 22
3.1
Joint distribution of two random variables ........................................................................... 22
3.1.1
Experiment 1: ................................................................................................................ 22
3.1.2
Experiment 2: ................................................................................................................ 23
3.2
Additivity of information ....................................................................................................... 24
3.3
Joint entropy ......................................................................................................................... 25
3.4
Conditional probability and conditional entropy .................................................................. 25
3.5
Definition of Mutual Information .......................................................................................... 26
3.5.1
Definition involving the joint distribution ..................................................................... 26
3.5.2
Definition involving the conditional entropy ................................................................ 26
3.6
Examples................................................................................................................................ 27
Adrian Radillo
15
IDH – Project Report
Measuring the brain’s response to sound stimuli
Introduction
The aim of this appendix is to explain in great detail the notions of “information” and of “mutual
information”. We will try to give the most basic explanations of every concept. Moreover, this
document includes a variety of examples designed to help the reader to grasp the new notions that
are introduced.
We first define the concept of discrete random variable with finite range and illustrate the
related notions of probability distribution, mean and variance. Then we introduce the concept of
Information, as it is considered in Information Theory, and discuss its meaning in comparison with
the one of the Variance. Finally, we consider the notion of Mutual Information between two random
variables, discuss its possible interpretations and compare it with the correlation coefficient.
1 Discrete Random Variable with finite range
A discrete random variable with finite range has three features. Firstly the variable must have a
name. Here, and in the rest of the document, we will denote an arbitrary random variable by πœ‰.
Secondly, the variable has a Range, that is to say, it has a finite number of (real) values that it can
take on. For instance, if the range of πœ‰, that we will denote by Range(πœ‰), is the set of values
(π‘₯1 , π‘₯2 , … , π‘₯𝑛 ), then only equalities of the type πœ‰ = π‘₯𝑖 (𝑖 = 1,2, … , 𝑛) can hold. Thirdly, a random
variable has a probability distribution, or with a simpler name, a distribution. A distribution, usually
denoted by 𝒫,is a set of probabilities associated with the range of the random variable. Thus, saying
that πœ‰ has distribution 𝒫 = (𝑝1 , 𝑝2 , … , 𝑝𝑛 ) means that πœ‰ takes on the value π‘₯𝑖 with probability
𝑝𝑖 (𝑖 = 1,2, … , 𝑛). Notice that in order for 𝒫 = (𝑝1 , 𝑝2 , … , 𝑝𝑛 ) to be a probability distribution, it
must fulfill the following conditions:
0 ≤ 𝑝𝑖 ≤ 1, ∀𝑖 ∈ {1,2, … , 𝑛},
{
𝑛
∑ 𝑝𝑖 = 1
𝑖=1
A random variable can be considered as a Random Experiment π’œ having a finite number of
possible different outcomes (𝐴1 , 𝐴2 , … , 𝐴𝑛 ) and which shows, each time it is run, one of them with a
fixed probability, i.e. 𝑃(𝐴𝑖 ) = 𝑝𝑖 (𝑖 = 1,2, … , 𝑛). When 𝑃(𝐴𝑖 ) = 1, 𝐴𝑖 is called a “sure event” and
when 𝑃(𝐴𝑖 ) = 0, 𝐴𝑖 is called an impossible event.
We will now provide several examples of random variables associated with random experiments. The
notation that we will use is the same as the one introduced previously.
ο‚·
When tossing a fair coin, the outcome can be head or tails, with probability .5 for each.
Hence the following distribution:
Adrian Radillo
16
IDH – Project Report
Measuring the brain’s response to sound stimuli
Figure 3 - Probability Distribution of a coin toss
1
Here, we can consider that the event “Heads” is 𝐴1 , “Tails” is 𝐴2 , and 𝑝1 = 𝑝2 = 2 .
ο‚·
When throwing a fair die, there are 6 possible outcomes with probability 1/6 each. Hence the
following distribution:
Figure 4 - Probability distribution of a fair die
1
Here, we can consider that the event “1” is 𝐴1 , “2” is 𝐴2 , …, “6” is 𝐴6 and 𝑝1 = 𝑝2 = β‹― = 𝑝6 = 6
.
ο‚·
In an English game of scrabble the 26 letters of the alphabet and the blanks are present in
the shuffling bag in the following quantities:
A B C D E
9 2 2 4 12
F G H I J K L M N O P Q R S T U V W X Y Z bl
2 3 2 9 1 1 4 2 6 8 2 1 6 4 6 4 2 2 1 2 1 2
Since there are 100 pieces in total, dividing each of these numbers by a hundred gives the
probability for each letter of being picked at the first draw. Thus, we deduce the following probability
distribution:
Adrian Radillo
17
IDH – Project Report
Measuring the brain’s response to sound stimuli
Figure 5 - Probability distribution of the scrabble letters
As in the previous examples, each letter constitutes an event with respect to the random
experiment of “drawing the first piece”.
ο‚·
One can also create arbitrary random variables, such as the following :
We define the range as (-12, -7, 2, 6, 11) and the probability distribution as (.5 .1 .2 .1 .1). Thus:
Figure 6 - Probability distribution of an arbitrary random variable
2 Information
2.1 Elucidation of key terms
The use of the word ‘information’ is misleading. In laymen’s terms, information is confused with
the content of a message. An individual having no knowledge of information theory would happily
agree that the words “tree” and “star” do not carry the same information. As a matter of fact, from
the point of view of information theory, these two words can be considered to carry the same
amount of information because they both have four letters. The reader will have noticed how much
the addition of the word “amount” changes the meaning of the sentence. In information theory, and
in this document, we will from now on always use the word information as meaning amount of
information. This is both a space saving measure and to follow the convention in this field.
Adrian Radillo
18
IDH – Project Report
Measuring the brain’s response to sound stimuli
Another semantic difficulty that appears in information theory is related to the fact that the
notion of information is also called entropy and uncertainty. We will therefore consider those three
words as perfectly interchangeable in what follows. In reality, those three words reflect three
different ways of considering the same mathematical22 quantity.
2.2 Mathematical definition of information
At our level, information is a quantity that is associated to a discrete random variable with finite
range and that depends only on its distribution. Let πœ‰ be a random variable with distribution
(𝑝1 , 𝑝2 , … , 𝑝𝑛 ). The information of πœ‰ is defined by the formula23:
𝑛
𝐻(πœ‰) = ∑ 𝑝𝑖 π‘™π‘œπ‘”2
𝑖=1
1
𝑝𝑖
This can be thought of as a measure of “how unpredictable” the variable is. Notice that for random
variables having finite range, the value of the entropy is bounded above. It is indeed maximal when
𝑝1 = 𝑝2 = β‹― = 𝑝𝑛 =
1
𝑛
, i.e. when every outcome has the same probability of occurring. One can
intuitively recognise this case as the one in which the outcome of a single trial of the experiment is
the most “unpredictable”. In this case, the value of the entropy is 𝐻(πœ‰) = log 2 𝑛. On the other hand,
the definition implies that the entropy is always positive, and it is minimal (𝐻(πœ‰) = 0) when the
outcome on a single trial is totally predictable, i.e. when one of the 𝑝𝑖 ′s is 1 and all the others are 0.
So in a nutshell, the entropy of a random variable with a range of n values is always contained
between 0 and log 2 𝑛. The following curve represents the maximal entropies that a discrete random
variable can possess as a function of the number of values that the variable can take on.
Figure 7 - Maximal entropy of a system as a function of its number of possible states
22
In fact, this quantity can even be considered as a physical one, since the term entropy was introduced by
Boltzmann who first applied information theory to Physics.
23
We may assume that none of the 𝑝𝑖 ’s is equal to zero. π‘™π‘œπ‘”2 is the base-2 logarithm.
Adrian Radillo
19
IDH – Project Report
Measuring the brain’s response to sound stimuli
Information is traditionally expressed in “bits”. One “bit” (contraction of the words “binary
digits”) is the amount of information that is necessary to distinguish any two distinct objects. One bit
of information is required to make a difference between 0 and 1, A and J or “John” and “my house”.
Because it depends only on the number of possible states and their respective weights, Information
is a dimensionless quantity. We will come back on this point in 3.2.2.
2.2.1
Abstract examples
We now show nine arbitrary distributions24 with their respective entropies:
Figure 8 - nine examples of distributions with their entropy
The first, second and third rows of graphs above concern respectively variables with ranges of
two five and twelve values. We would like to make a number of comments on these graphs:
-
As we have said, for a given range, the maximal entropy is attained when all the probabilities
are equal. Thus, the first graphs of the rows 1 and 2 above fulfill this condition. It would be
tempting to think that the more outcomes a random experiment has, the more entropy it
shows. However, this is a wrong intuition as we explain immediately below.
-
The first graph of the third row constitutes a counter-example to the previous intuition. A
system with 12 different possible states can show a lower entropy than a system with five
possible states. When this is the case, we can interpret that the former is “more predictable”
than the latter. The shapes of the distributions of the first graphs of the rows 2 and 3 give a
visual representation of this idea. Whereas each outcome has the same probability of
occurring in the system with 5 states, the system with twelve states shows a very high
probability of adopting the state 3 and a very low probability of being in any other state.
24
Since the specific values in the range do not affect the entropy, we present all the distributions with
standardised x-coordinates, to facilitate the visualisation of the distribution.
Adrian Radillo
20
IDH – Project Report
-
2.2.2
Measuring the brain’s response to sound stimuli
The reader might have noticed that the second and third graphs of the second row have the
same entropy. This is because they have the same distribution up to a permutation of their
values. As we already mentioned, this emphasises the fact that the entropy of a random
variable is only a property of its distribution and does not depend on its particular range.
Practical examples
Entropy is a quantity that is dimensionless. This means that it is possible to compare entropies
of systems which are each other totally heterogeneous, provided that their states observe a
probability distribution.
It should be clear to the reader that a fair coin toss has an entropy of 1 bit and that a fair die
throw has an entropy of log 2 (6) = 2.585 bits. We might interpret this difference of entropy by
saying that “there is more uncertainty in the result of throwing a die than in the result of a coin toss”.
This idea fits quite well to our intuition.
If we come back to our example of the game of scrabble, the experiment of drawing the first
letter at the beginning of the game possesses an entropy of 4.3702 bits. If we compute this quantity
for a game of Finnish scrabble in which the distribution of the letters is different, we obtain a value of
4.2404 bits. So there is more uncertainty in drawing the first letter of an English scrabble than of a
Finnish one.
A simplified version of the EuroMillion lottery game is the following. Five “main numbers” are
drawn at random from a box containing fifty balls, numbered from 1 to 50. Since the order in which
they are drawn does not matter, the number of possible drawings is (50
) = 2,118,760. Assuming that
5
each final draw has the same probability of occurrence, the entropy of such an experiment is
log 2 (2,118,760) = 21.0148 bits.
Before we go on to define a new concept, let us summarise this first section of the document. So
far, we have seen three different quantities that can be extracted from a discrete random variable
with finite range: the mean, the variance and the information. The mean depends heavily on the
values that the variable can take on since it represents its average value. The variance does not
depend directly on the values of the range but rather on their scattering around the mean. It is
therefore particularly representative of the distance between the most weighted values. Eventually,
the information does not depend in the least on the values of the range, although its value is
bounded by the number of elements in the range.
Finally, It is worth answering the following question:
What can be said about a random variable when we are given its information value?
1- If the information is equal to zero, then there is one value that the random variable takes on
with probability 1. But we have no means of telling which value it is.
2- If there is a positive integer 𝑛 such that log 2 𝑛 is less or equal to the information, then the range
of the random variable contains at least 𝑛 distinct values which have corresponding probabilities
greater than 0.
3- There is no way of finding an upper bound for the number of elements in the range of the
random variable. It is therefore possible for two random variables to have the same information
Adrian Radillo
21
IDH – Project Report
Measuring the brain’s response to sound stimuli
value but ranges with different cardinalities25. Moreover, if the information value is very low so
that the variable can be considered as quite “predictable”, it is not possible to tell what values
the random variable is highly probable to take on.
So let us insist once again on the fact that knowing the amount of information of a random
variable does not entitle us to know anything about the values in its range, nor about their
respective probabilities. The second and third graphs of the second row in section 3.2.1 help to
make this point clear.
3 Study of two random variables
3.1 Joint distribution of two random variables
Let πœ‰ and πœ‚ be random variables with respective ranges (π‘₯1 , π‘₯2 , … , π‘₯π‘š ) and (𝑦1 , 𝑦2 , … , 𝑦𝑛 ) and
respective distributions (𝑝1 , 𝑝2 , … , π‘π‘š ) and (π‘ž1 , π‘ž2 , … , π‘žπ‘› ). The joint distribution of (πœ‰, πœ‚) is the set of
probabilities (π‘Ÿ11 , π‘Ÿ12 , … , π‘Ÿπ‘šπ‘› ) where π‘Ÿπ‘—π‘˜ (𝑗 = 1,2, … , π‘š ; π‘˜ = 1,2, … , 𝑛) is the probability that “πœ‰ = π‘₯𝑗
AND πœ‚ = π‘¦π‘˜ ” . Considering πœ‰ and πœ‚ as two random experiments which are run in parallel, and
considering their respective possible outcomes to be (π‘₯1 , π‘₯2 , … , π‘₯π‘š ) and (𝑦1 , 𝑦2 , … , 𝑦𝑛 ), π‘Ÿπ‘—π‘˜ becomes
the probability that the pair (π‘₯𝑗 , π‘¦π‘˜ ) is obtained as joint outcome. We give below several examples of
this notion.
3.1.1
Experiment 1:
3.1.1.1 One die throw
A fair die is thrown. We associate the random variable πœ‰ to the events (“even number”,”odd
number”) and the random variable πœ‚ to (“1”, “2”, “3”, “4”, “5”, “6”). It is clear that their respective
distributions are (1/2,1/2) and (1/6,1/6,1/6,1/6,1/6,1/6). Since πœ‰ and πœ‚ refer to the same die
throw, it is impossible to have the pairs of events (“1”,“even”), (“3”,“even”), (“5”,“even”),
(“2”,“odd”), (“4”,“odd”), (“6”,“odd”). And since each of the 6 remaining possible pairs has the same
probability of occurrence, this probability is 1/6. We show a graphic representation of this joint
distribution at the end of section 4.1.1.2.
3.1.1.2 Two-dice throw
Two fair dice that can be distinguished are thrown together. Assume that they show their results
independently one from another. We associate the random variable πœ‰ to the events (“Die 1 shows an
even number”,”Die 1 shows an odd number”) and the random variable πœ‚ to (“Die 2 shows 1”, “Die 2
shows 2”, “Die 2 shows 3”, “Die 2 shows 4”, “Die 2 shows 5”, “Die 2 shows 6”). In this setting, all 12
pairs of values can occur with equal probability 1/12.
Here are the graphical representations of the joint distributions of experiments 1 and 2:
25
The cardinality of the range is the number of distinct elements that it contains.
Adrian Radillo
22
IDH – Project Report
Measuring the brain’s response to sound stimuli
Figure 9 - Joint distributions for experiment 1
3.1.2
Experiment 2:
A taxi driver picks up one person at random among 10 people waiting at a taxi station. Denoting
by 𝐾𝑖 (𝑖 = 1,2, … 10) the event “the person 𝑖 is picked up”, we have 𝑃(𝐾𝑖) = 1/10. The taxi driver
can only take people to four areas in town: a poor area, a touristic area, a business area and a
shopping area. Denoting by 𝐴𝑗 (𝑗 = 1,2,3,4) the event “a random person wants to go to the area j”,
we assume that the following probabilities are known:
𝑃(𝐴1) = 0.1, 𝑃(𝐴2) = 0.4, 𝑃(𝐴3) = 0.3, 𝑃(𝐴4) = 0.2
In what we will call the Independent Case, the ten people constitute a perfectly random sample
of the population of the town. In this case, the joint probability 𝑃(𝐴𝑗, 𝐾𝑖) (𝑗 = 1,2,3,4 ; 𝑖 =
1,2, … 10) that the person 𝑗 wants to go to the area 𝑖 is 𝑃(𝐴𝑗)𝑃(𝐾𝑖)26.
On the other hand, if among the ten people, six of them are tourists and definitely want to go to
the touristic area, the joint distribution will change27.
We present below the joint distributions in these two cases.
26
As we will explain further in the text, this corresponds to the mathematical definition of the two random
variables being independent.
27
We intentionnally did not put the details of the calculations because this would take us too far from our main
purpose. However, notice that in this last case, we implicitly changed the distribution of the variable referring
to the 4 places. This has become: 𝑃(𝐴1) = 0.04, 𝑃(𝐴2) = 0.76, 𝑃(𝐴3) = 0.12, 𝑃(𝐴4) = 0.08. This has as
effect to reduce the entropy of this variable. See section 4.5 for a visual representation.
Adrian Radillo
23
IDH – Project Report
Measuring the brain’s response to sound stimuli
Figure 10 - Joint distributions for experiment 2
3.1.2.1 Notion of independence
Experiments 1 and 2 above were designed to give to the reader an intuitive idea of what it
means for two random variables to be independent versus dependent. In experiment 2, we gave
explicit names to both subcases. In experiment one, the independent case is the “two-dice throw”
case and the dependent case is the “one-die throw”. The reader will have noticed that the
dependence of two random variables is related to their joint distribution.
The formal definition of independence for two discrete and finite random variables is the
following:
The random variables πœ‰ and πœ‚ with respective ranges (π‘₯1 , π‘₯2 , … , π‘₯π‘š ) and (𝑦1 , 𝑦2 , … , 𝑦𝑛 ), respective
distributions (𝑝1 , 𝑝2 , … , π‘π‘š ) and (π‘ž1 , π‘ž2 , … , π‘žπ‘› ), and joint distribution (π‘Ÿ11 , π‘Ÿ12 , … , π‘Ÿπ‘šπ‘› ) (with π‘Ÿπ‘—π‘˜ =
𝑃(πœ‰ = π‘₯𝑗 , πœ‚ = π‘¦π‘˜ ) (𝑗 = 1,2, … , π‘š ; π‘˜ = 1,2, … , 𝑛) ) are said to be independent if for all 𝑗’s and for all
π‘˜’s we have:
π‘Ÿπ‘—π‘˜ = 𝑃(πœ‰ = π‘₯𝑗 )𝑃(πœ‚ = π‘¦π‘˜ ) = 𝑝𝑗 π‘žπ‘˜
(𝑗 = 1,2, … , π‘š ; π‘˜ = 1,2, … , 𝑛)
3.2 Additivity of information
The very mathematical definition of information provides it with an important property which is
called additivity. This property means that if two random variables πœ‰ and πœ‚ are independent, the
information of the random variable created from all the possible pairs of values (π‘₯, 𝑦) (π‘₯ ∈
π‘…π‘Žπ‘›π‘”π‘’(πœ‰), 𝑦 ∈ π‘…π‘Žπ‘›π‘”π‘’(πœ‚)) is equal to the sum of the information of πœ‰ and of the information of πœ‚.
Thus:
𝐻((πœ‰, πœ‚)) = 𝐻(πœ‰) + 𝐻(πœ‚)
Adrian Radillo
24
IDH – Project Report
Measuring the brain’s response to sound stimuli
3.3 Joint entropy
The quantity 𝐻((πœ‰, πœ‚)) is sometimes called the “joint entropy of πœ‰ and πœ‚”28. If we use our
previous notations (4.1.2.1), 𝐻((πœ‰, πœ‚)) corresponds to the information of (πœ‰, πœ‚) with distribution
(π‘Ÿ11 , π‘Ÿ12 , … , π‘Ÿπ‘šπ‘› ). Therefore, everything that has been said about the concept of information in
section 3 applies here. 𝐻((πœ‰, πœ‚)) is a measure of the “unpredictability” of the pairs of simultaneous
values (π‘₯𝑗 , π‘¦π‘˜ ) (𝑗 = 1,2, … , π‘š ; π‘˜ = 1,2, … , 𝑛). The more “unpredictable” the pairs are, the higher
the joint entropy. As intuition could suggest, the joint entropy is maximum when the two variables
are independent (4.2.1), and is minimum (0) when πœ‰ and πœ‚ are both totally predictable. This can be
formally proven (Rényi, 2007, pp. 557–558) and is summarised in the following inequality which is
always true:
0 ≤ 𝐻((πœ‰, πœ‚)) ≤ 𝐻(πœ‰) + 𝐻(πœ‚)
We might interpret it as follows: “The uncertainty of a pair of values (π‘₯𝑗 , π‘¦π‘˜ ) of (πœ‰, πœ‚) cannot exceed
the sum of the uncertainties of πœ‰ and πœ‚”.
3.4 Conditional probability and conditional entropy
Consider again two random variables πœ‰ and πœ‚ with same range as above. We now look at the
probability that πœ‰ takes on a particular value, given that πœ‚ takes on a specific value too. This kind of
probability is called a conditional probability. Here is the notation and definition that we will use to
denote the “probability that πœ‰ takes on the value π‘₯𝑗 given that πœ‚ takes on the value π‘¦π‘˜ ”29:
𝑝𝑗|π‘˜ ≔ 𝑃(πœ‰ = π‘₯𝑗 |πœ‚ = π‘¦π‘˜ ) ≔
𝑃(πœ‰ = π‘₯𝑗 , πœ‚ = π‘¦π‘˜ )
𝑃(πœ‚ = π‘¦π‘˜ )
It can be proven30 that π’«π‘˜ ≔ (𝑝1|π‘˜ , 𝑝2|π‘˜ , … , π‘π‘š|π‘˜ ) (π‘˜ = 1,2, … , 𝑛), is a probability distribution per
se. It therefore has an information value associated to it, 𝐻(π’«π‘˜ ). This information value can be
considered as the entropy of πœ‰ when πœ‚ is fixed to the value π‘¦π‘˜ . The set of all 𝐻(π’«π‘˜ ), when k goes
from 1 to n, forms the range of a new random variable with probability distribution (π‘ž1 , π‘ž2 , … , π‘žπ‘› ).
This leads us to the notion of conditional entropy:
𝑛
𝐻(πœ‰|πœ‚) ≔ ∑ π‘žπ‘˜ 𝐻(π’«π‘˜ )
π‘˜=1
This quantity represents the average entropy of the variable πœ‰, once πœ‚ has taken on a specific value.
Notice that the following is true31:
𝐻((πœ‰, πœ‚)) = 𝐻(πœ‚) + 𝐻(πœ‰|πœ‚) = 𝐻(πœ‰) + 𝐻(πœ‚|πœ‰)
28
See (Cover & John Wiley & Sons, 1991, p. 15)
This formula applies only in cases where 𝑃(πœ‚ = π‘¦π‘˜ ) ≠ 0. The notion of conditional probability does not make
sense otherwise.
30
See (Rényi, 2007, p. 56)
31
See (Cover & John Wiley & Sons, 1991, p. 16) for formal proof.
29
Adrian Radillo
25
IDH – Project Report
Measuring the brain’s response to sound stimuli
3.5 Definition of Mutual Information
The Mutual Information of two random variables depends on the respective distributions of the
two variables and on their joint distribution. It has been defined and interpreted in several ways:
-
The relative information given by one variable about the other (Rényi, 2007, p. 558)
The relative entropy between the joint distribution (π‘Ÿ11 , π‘Ÿ12 , … , π‘Ÿπ‘šπ‘› ) and the product
-
distribution (𝑝1 π‘ž1 , 𝑝2 π‘ž1 , … , 𝑝𝑗 π‘žπ‘˜ , … , π‘π‘š π‘žπ‘› )32 (Cover & John Wiley & Sons, 1991, p. 18)
The reduction in the uncertainty of one variable du to the knowledge the other one (Cover &
John Wiley & Sons, 1991, p. 20).
3.5.1
Definition involving the joint distribution
One formal definition of mutual information is the following: Let πœ‰ and πœ‚ be defined as in
3.1.2.1. Then the mutual information of πœ‰ and πœ‚ is,
𝐻(πœ‰, πœ‚) ≔ 𝐻(πœ‰) + 𝐻(πœ‚) − 𝐻((πœ‰, πœ‚))
If we recall that 𝐻(πœ‰) + 𝐻(πœ‚) is the joint entropy of πœ‰ and πœ‚ in a “hypothetical case” where they are
independent, then the mutual information of πœ‰ and πœ‚ is the difference between this “hypothetical
joint entropy” and the “real” joint entropy. At least two perspectives can be drawn from there:
-
-
Since the joint entropy is strongly related to the independence of the two variables, the
mutual information becomes a measure of their relative dependence. More specifically, if
the joint entropy is close to its maximum possible value (i.e. the case where πœ‰ and πœ‚ are
independent), then the mutual information is low. On the contrary, if the joint entropy is far
from its maximum value, then the variables πœ‰ and πœ‚ are strongly related and the mutual
information is high. This is the reason why the mutual information “can be considered as a
measure of the stochastic dependence between the random variables πœ‰ and πœ‚” (Rényi, 2007,
p. 559).
If we consider that the joint entropy is a measure of the “unpredictability” of the pairs
(π‘₯𝑗 , π‘¦π‘˜ ), and that the maximum value for this “unpredictability” is 𝐻(πœ‰) + 𝐻(πœ‚), then the
difference between these two quantities becomes the relative “loss of uncertainty” of the
pairs (π‘₯𝑗 , π‘¦π‘˜ ), between the cases when the variables are independent and the “real” case.
Hence, this difference can also be seen as the “predictability”33 of the pairs (π‘₯𝑗 , π‘¦π‘˜ ).
Although we will not prove it here34, the mutual information cannot exceed the minimum
entropy of πœ‰ and πœ‚: 𝐻(πœ‰, πœ‚) ≤ min[𝐻(πœ‰), 𝐻(πœ‚)]. Notice also that 𝐻(πœ‰, πœ‚) = 𝐻(πœ‚, πœ‰). That is to say, πœ‰
carries as much information about πœ‚ as πœ‚ does about πœ‰.
3.5.2
Definition involving the conditional entropy
Another equivalent, and maybe more often used, definition of mutual information is:
𝐻(πœ‰, πœ‚) ≔ 𝐻(πœ‰) − 𝐻(πœ‰|πœ‚)
32
Where we still use the same notations as in the preceding sections.
Here, we implicitly consider that a loss of uncertainty is equivalent to a gain of predictability.
34
Again, see (Rényi, 2007) for the formal proof.
33
Adrian Radillo
26
IDH – Project Report
Measuring the brain’s response to sound stimuli
Given what has been said above, this quantity can be interpreted here in at least two ways:
-
It is the entropy of πœ‰ minus the average entropy of πœ‰ when πœ‚ is fixed.
It is the variability of πœ‰ which is linked to the variability of πœ‚.
3.6 Examples
ο‚·
The following are the values of 𝐻(πœ‰), 𝐻(πœ‚) and 𝐻(πœ‰, πœ‚) for the two experiments of section
4.1.
Figure 11 - mutual information and entropies for experiments 1 and 2
The two graphs on the left are the “dependent cases”, the two on the right represent the
“independent cases”. As expected, the mutual information is equal to zero in the latter.
ο‚·
We now provide a last example of the notion of mutual information. Let S be a signal that
can take on 4 different values: 𝑆1 , 𝑆2 , 𝑆3 , 𝑆4 with probability 1/4 for each one of them. Let R
be a response that can also take on 4 distinct values 𝑅1 , 𝑅2 , 𝑅3 , 𝑅4 . We will consider R to be a
“tracker” of the signal S. We will therefore consider 5 different cases:
o Case 1: R fully tracks S. That is to say, there is an injective function
πœ‘: π‘…π‘Žπ‘›π‘”π‘’(𝑆) → π‘…π‘Žπ‘›π‘”π‘’(𝑅) such that 𝑅 = πœ‘(𝑆).
o Case 2: R tracks only three of the four outcomes of S and provides a random
outcome for the remaining value of S.
o Case 3: R tracks only two of S’s outcomes and provides a random outcome for the
remaining values of S.
o Case 4: R tracks only one of S’s outcomes and provides a random outcome for the
remaining values of S.
o Case 5: R tracks none of S’s outcomes and provides a random outcome for all the
values of S.
Adrian Radillo
27
IDH – Project Report
Measuring the brain’s response to sound stimuli
After generating 100,000 joint occurrences of R and S with Matlab, here are the graphs of the
estimated Mutual Informations obtained for the 5 cases, and the respective entropies of the signal
and of the response:
Figure 12 - Entropies and mutual information for the tracker example
Adrian Radillo
28
IDH – Project Report
Measuring the brain’s response to sound stimuli
Appendix B: Information Breakdown toolbox
(Magri et al., 2009) developped the Information Breakdown Toolbox (ibTB) for the software
Matlab (The Mathworks, Natick, MA). “This toolbox implements a novel computationally-optimized
algorithm for estimating many of the main information theoretic quantities and bias correction
techniques used in neuroscience applications.” (Magri et al., 2009, p. 1) The central quantity that this
toolbox enables to compute is the mutual information between a stimulus and its neural response.
Use of the toolbox
The toolbox is a folder containing several Matlab m.files. The main function to compute mutual
information is information.m. With our notation from 1.5, this function takes as argument
{π‘…π‘˜ (𝑑)|π‘˜ = 1,2, … , π‘‘π‘Ÿ ; 𝑑 = 1,2, … , 𝑛} and returns the output 𝐼(𝑅, 𝑆). In matlab notation, the input of
this function is a matrix Rmatlab which has dimensions ch x tr x n, and the output is just a scalar. The
function information.m takes also optional arguments which specify the computational techniques
and the bias correction methods that the user wants to use. The user can also ask for additional
outputs corresponding to the information breakdown quantities.
Finally, it is worth mentioning the necessity for the entries of the input matrix Rmatlab to be
“binned” into a finite number of non-negative integer values. The function binr.m does this.
Information breakdown
The following studies, (Magri et al., 2009; Pola, Thiele, Hoffmann, & Panzeri, 2003; Schneidman
et al., 2003) mention a way in which the mutual information can be decomposed:
𝐼 = 𝐼𝑙𝑖𝑛 + 𝐼𝑠𝑖𝑔−π‘ π‘–π‘š + πΌπ‘π‘œπ‘Ÿ−𝑖𝑛𝑑 + πΌπ‘π‘œπ‘Ÿ−𝑑𝑒𝑝
The toolbox can compute all the terms of this sum. Each one of them has a neuroscientific
meaning that we will briefly describe here35.
-
𝐼𝑙𝑖𝑛 is the sum of all the mutual information values generated from each channel
individually. It therefore corresponds to the mutual information of a hypothetical response
which would have all its channels independent from each other36. The formal definition of
𝐼𝑙𝑖𝑛 is : 𝐼𝑙𝑖𝑛 ≔ ∑π‘β„Ž
𝑖=1 𝐼(π‘Ÿπ‘– , 𝑆)
-
𝐼𝑠𝑖𝑔−π‘ π‘–π‘š is “the amount of redundancy [in the neural response] specifically due to signal
correlations”(Magri et al., 2009, p. 5). This is why this quantity is negative or 0. Signal
correlations are either correlations between several stimuli, or correlations due to the fact
that several channels in the response are processing the same part of the stimulus
(Schneidman et al., 2003, p. 11542).
-
πΌπ‘π‘œπ‘Ÿ−𝑖𝑛𝑑 “reflects the contribution of stimulus-independent correlations” between several
channels of the neural response (Magri et al., 2009, p. 5).
πΌπ‘π‘œπ‘Ÿ−𝑑𝑒𝑝 has a technical meaning that goes beyond the scope of the present document.
-
35
36
The reader is invited to consult the given references for more detail.
Cf. 3.1 and 3.2 of appendix A on the notions of additivity of information and independence.
Adrian Radillo
29
IDH – Project Report
Measuring the brain’s response to sound stimuli
Bias correction techniques
Eventually, as (Panzeri, Senatore, Montemurro, & Petersen, 2007) have pointed out, estimating
the conditional probabilities with the formula37 π‘πœΆ|𝑠𝑑 ≔
𝜢 (𝑑)
π‘π‘‘π‘Ÿ
π‘‘π‘Ÿ
is subject to a systematical error since
𝜢 (𝑑)
π‘π‘‘π‘Ÿ
.
π‘‘π‘Ÿ→∞ π‘‘π‘Ÿ
the real probabilities need an infinite number of trials to be found: π‘πœΆ|𝑠𝑑 = lim
This so-called
“limited sampling bias” can provoke an error “as big as the information value we wish to estimate”
(Magri et al., 2009, p. 9). The toolbox enables the user to apply several bias correction techniques to
reduce the impact of this bias. The reader is invited to consult (Magri et al., 2009) for further details.
See also (Nemenman, Bialek, & de Ruyter van Steveninck, 2004; Strong, Koberle, de Ruyter van
Steveninck, & Bialek, 1998) for further discussions on the sampling bias problem.
.
37
Cf. 1.5.3
Adrian Radillo
30
IDH – Project Report
Measuring the brain’s response to sound stimuli
Appendix C: Matlab Codes
We present below the matlab scripts used in our project. Each script is contained in an individual
box.
Naive computation of information theoretic quantities
function logbase2 = log_2(x)
%log_2(x) computes the base-2 logarithm of x
logbase2=log(x)/log(2);
end
function H = entropy_perso(P)
%takes as argument a (1-by-n) probability mass function P and outputs the
%entropy of this distribution
P=P(find(P)); %Removing all the zeros from P
H=-sum(P.*log_2(P));
end
function countvec = count_vectors(Data)
%Argument: an M-by-N-by-O matrix
%Returns: an M+1-by-K matrix where each column contains, in its M first
%
rows, the M-by-1 vectors contained in Data and in its M+1 entry,
%
the number of times that this vector occurs in Data, accross N
%
and O.
[M N O]=size(Data);
pairs=reshape(Data,M,N*O);
pairs_mod=pairs;
index=[];
l=length(pairs_mod);
i=1;
while i<=l
count = 1;
A=pairs_mod(:,i);
j=i+1;
while j<l
if A==pairs_mod(:,j);
count=count+1;
pairs_mod=[pairs_mod(:,1:j-1) pairs_mod(:,j+1:l)];
l=l-1;
else
j=j+1;
end
end
if j==l
if A==pairs_mod(:,j)
count=count+1;
pairs_mod=pairs_mod(:,1:j-1);
l=l-1;
end
end
index=[index count];
i=i+1;
end
countvec=[pairs_mod;index];
end
Adrian Radillo
31
IDH – Project Report
Measuring the brain’s response to sound stimuli
function C = Cond_entropy_perso(Data,Q)
%Argument: Data is a M-by-N-by-O response matrix, where M is the number of
%channels of the response, N is the number of trials and O is
%the number of stimuli
%Q is the 1-by-O probability distribution of the stimuli.
%Returns: the conditional entropy H(R|S)
[M, N, O]=size(Data);
F=zeros(1,O); %each F(i) corresponds to p(s)*H(R|s)
for k=1:O
%loop on stimuli
G=count_vectors(Data(:,:,k)); %counts the distinct response vectors
%across all trials in the stimulus k
P=G(M+1,:)/sum(G(M+1,:)); %builds the distribution p(R|s)
F(k)=Q(k)*entropy_perso(P); %preprocesses the final term in the
%expectation
end
C=sum(F); %expectation
end
function goal = binn_imitate(x,I)
%x is d-by-tr-by-s, where:
%d is the number of dimensions of the response
%tr is the number of trials,
%s is the number of time points (or stimuli)
%I is 2-by-n, where:
%n is the number of bins, and the columns are bounds of intervals
%the function returns the matrix "goal" which is of size "size(x)"
%"goal" contains a 1 at the entries (w,y,z) whenever the value of
%x(w,y,z) is in the bin k (0<=k<=n-1).
%this function is designed to imitate the function binr.m from the ibTB
%toolbox
[d tr s]=size(x); n=size(I,1); %size of x and I
goal=zeros(d,tr,s); %size of the output
for m=1:d
%loop on dimensions
for j=1:tr %loop on trials
for i=1:s %loop on time points
%logical n-by-1 array with a 1 only when x(j,i) is in the bin
logic=and(x(m,j,i)>=I(:,1),x(m,j,i)<I(:,2));
%case where x(i) is equal to the upper bound of the last interval
if sum(logic)==0
logic(n)=1;
end
goal(m,j,i)=find(logic)-1;
end
end
end
end
Adrian Radillo
32
IDH – Project Report
Measuring the brain’s response to sound stimuli
function I = perso_I(X)
%perso_I computes the Mutual Information of the response array X which has
%dimensions Channels-by-Trials-by-Stimuli. The function returns the
%Channels-by-1 vector I which contains the mutual information of each
%array about the stimulus. This function makes the assumption that each
%stimulus is equiprobable.
%
%This function computes the MI values without bias correction. It is
%designed to imitate the function information.m from the ibTB toolbox.
[Nch, Ntr, Nsti] = size(X);
I = zeros(1,Nch);
for ch=1:Nch
Rch = binn_imitate(X(ch,:,:), [-pi -pi/2;-pi/2 0;0 pi/2;pi/2 pi]);
Rbis=count_vectors(Rch);
Joint=Rbis(2,:)/(Ntr*Nsti); %joint distribution of responses between them
I(ch)=entropy_perso(Joint)-Cond_entropy_perso(Rch,1/Nsti*ones(1,Nsti));
end
end
function output = MI(P,Q,J)
%Q is the 1-by-n probability distribution of S
%P is the 1-by-m probability distribution of R
%J is the 1-by-mn joint distribution of (R,S)
%Computing the information contained in the Stimulus and in the Response,
%or equivalently, their respective entropies, and the joint entropy.
H_S=entropy_perso(Q); %where the function log_2 is the base-2 logarithm
H_R=entropy_perso(P);
H_J=entropy_perso(J);
%Computing the Mutual Information
output=H_R+H_S-H_J;
end
Adrian Radillo
33
IDH – Project Report
Measuring the brain’s response to sound stimuli
Toolbox call
function I = Toolbox_I(X)
%Toolbox_I computes the Mutual Information of the response array X
%
Given a Channels-by-Trials-by-Stimuli array X, returns the
%
Channels-by-1 vector I which contains the mutual information of each
%
array about the stimulus. This function makes the assumption that each
%
stimulus is equiprobable.
%
This function uses the ibTB toolbox' function information, and computes
%
the MI values without bias correction.
[Nch, Ntr, ignore] = size(X);
opts.nt
= Ntr;
opts.method = 'dr';
opts.bias
= 'naive';
I = zeros(1,Nch); %empty variable for mutual information
for ch=1:Nch
Rch = binr(X(ch,:,:), Ntr, 4, 'eqspace', [-pi pi]);
I(ch) = information(Rch, opts, 'I');
end
end
Adrian Radillo
34
IDH – Project Report
Measuring the brain’s response to sound stimuli
Workflow
function ph = phase(x)
%extracts the phase ph(t) from a signal x(t) using the hilbert transform
ph=angle(hilbert(x));
end
%%%%%%%%
%%%%%%%%
%%%%%%%%
%%%%%%%%
%%%%%%%%
This script is designed to follow a workflow similar to the one
presented in the article:
"A mutual information analysis of neural coding of speech by
low-frequency MEG phase information", from Cogan & Poeppel
It makes use of the information breakdown toolbox ibTB
%===============================VARIABLES==================================
%creating/importing an input waveform X of dimensions M-by-N-by-O
%corresponding to Channels-by-Trials-by-Stimulus
%-----------------------------------%%%the user needs to import or create X here%%%
%-----------------------------------[Nch Ntr Nsti]=size(X); %Note that in Cogan&Poeppel article, the number of
%stimuli equals the number of time points
%creating the frequency variables
fs=1000;
%given sampling frequency
f_Nyq=fs/2;
%Nyquist frequency
d=4;
%down-sampling factor (must divide fs)
fs_new=fs/d;
%new sample frequency (for further resampling)
Nbands=3;
%Number of frequency bands that we will analyse
%creating the time variables
t_init=(0:Nsti-1)/fs; %initial time vector
t_rs=(0:(Nsti/d-1))/fs_new;%new time vector for further
%resampling
k=11;
%absolute time limit in sec of the response that we are
%interested in
Nsti_fin=k*fs_new;
%size of the down-sampled response (for 1 channel,
%1 trial)
t_fin=t_rs(1:Nsti_fin); %final time vector
%creating empty arrays for the filtered signals and for the
%Mutual Information values
X_filt=zeros(Nch,Ntr,Nsti,Nbands); %filtered signals
X_resized=zeros(Nsti_fin,Ntr);
MutInf=zeros(Nch,Nbands); %there is an MI value for each channel and
%each frequency band
%toolbox options
opts.nt=Ntr;
opts.method='dr'; %direct method
opts.bias='qe'; %quadratic extrapolation correction technique
%see next page for the rest of the script
…
Adrian Radillo
35
IDH – Project Report
Measuring the brain’s response to sound stimuli
…
%============================BAND PASS FILTER==============================
%we create in this section a bandpass filter with Nbands bandpasses
%[1-3;3-5;5-7] in Hz
%setting the absolute frequencies values of the bandpass (in Hz)
fmin_abs=[1 3 5]; fmax_abs=[3 5 7];
%conversion into normalised frequency bounds
fmin_norm=fmin_abs/f_Nyq; fmax_norm=fmax_abs/f_Nyq;
%we choose an order of 814 as done in the source article:
b=[fir1(814,[fmin_norm(1) fmax_norm(1)]);
fir1(814,[fmin_norm(2) fmax_norm(2)]);
fir1(814,[fmin_norm(3) fmax_norm(3)])];
%=============================SIGNAL PROCESSING============================
for band=1:Nbands
for ch=1:Nch
for trial=1:Ntr
%FILTERING (zero-phase shift with filtfilt)
X_filt(ch,trial,:,band)=filtfilt(b(band,:),1,X(ch,trial,:));
%RESAMPLING/SELECTING TIME-WINDOW
ts=timeseries(X_filt(ch,trial,:,band),t_init);%initial timeseries
ts_rs=resample(ts,t_rs);
%down-sampled time series
y=squeeze(ts_rs.data)';
%down-sampled data
X_resized(:,trial)=y(1:Nsti_fin); %dimension [stimulus x trials]
end
%===================MUTUAL INFORMATION=============================
%PHASE EXTRACTION/BINNING
%phi has dimension [ch x Nsti_fin x Ntr]:
phi=reshape(phase(X_resized)',1,Ntr,Nsti_fin);
X_binned= binr(phi, Ntr, 4, 'eqspace', [-pi pi]);%binned response
%Computing the mutual information, and putting negatives values to 0
MutInf(ch,band)=max(information(X_binned,opts,'I'),0);
end
end
Adrian Radillo
36
IDH – Project Report
Measuring the brain’s response to sound stimuli
Influence of the number of bins on the phase mutual information value
% We analyse the impact of changing the number of bins for Magri et al. 's
% phase data.
% Loading DATA matrix of size <no. points x no.trials x no. channels>
scriptDir = which('EEG_phase_example');
% Removing the word 'EEG_TEST' from the returned path:
scriptDir = scriptDir(1:end-20);
temp = load(fullfile(scriptDir, 'data', 'EEG_data_phase'));
data = temp.data;
clear temp;
[time, Nt, Nch] = size(data);
% Setting information theoretic analysis options:
opts.nt
= Nt;
opts.method = 'dr';
opts.bias
= 'qe';
K=100; %number of different cases we want to investigate
HR = zeros(Nch, K);
HRS = zeros(Nch, K);
%We put the response matrix in the form Channels x trials x stimulus
R=permute(data,[3 2 1]);
for ch=1:Nch
Rch = R(ch,:,:);
Rch_new=zeros(5,Nt,time);
for k=1:K
Rch_new(k,:,:)= binr(Rch, Nt, k+1, 'eqspace', [-pi pi]);
[HR(ch,k),HRS(ch,k)] = entropy(Rch_new(k,:,:), opts, 'HR', 'HRS');
end
end
I=zeros(Nch,5);%mutual information variable
for k=1:K
I(:,k)=HR(:,k)-HRS(:,k);
end
plot(2:K+1,I(60,:),'.');%plotting only the most informative channel which
%is the channel 60.
title('mutual information of one channel as a function of the number of phase
bins','fontsize',20);
xlabel('number of bins','fontsize',20);
ylabel('mutual information in bits','fontsize',20);
Adrian Radillo
37
IDH – Project Report
Measuring the brain’s response to sound stimuli
Correction technique applied to a confusion-matrix
%Computing the MI values with bias correction techniques from the
%consonants-confusion matrix provided in the article:
%"Perceptual Confusions among Consonants, Revisited - Cross-Spectral
%Integration of Phonetic-Feature Information and Consonant Recognition",
%from (Christiansen & Greenberg, 2012)
%for the 1500Hz slit we have for consonants:
C_cons=[7 11 2 3 4 2 1 1 3 1 1
4 15 2 1 5 0 2 2 3 1 1
7 7 12 3 1 1 1 0 3 0 1
1 0 0 14 11 3 0 0 3 3 1
0 0 1 5 24 5 0 0 0 0 1
1 0 2 9 9 7 0 1 3 0 4
1 3 1 0 1 0 13 3 1 6 7
1 7 0 1 1 0 6 9 3 2 6
0 2 2 0 2 3 0 3 16 4 4
0 0 0 2 1 1 0 1 0 24 7
0 0 0 1 2 3 1 0 0 11 18];
%probability distribution of the stimuli:
P_stim_cons=1/11*ones(1,11);
%conditional probabilities:
C_cons_cond=C_cons/36;
%probability distribution of the responses:
P_resp_cons=sum(C_cons_cond)/11;
%joint probability distribution
P_joint_cons=reshape(C_cons/396,1,121);
%Naive computation of the "normalized" mutual information
MI_naive=MI(P_resp_cons,P_stim_cons,P_joint_cons)/3.46;
%%%%%%%%%%% BIAS CORRECTION TECHNIQUES USING TOOLBOX
%creation of the response matrix for input of the toolbox:
%We artificially put the response matrix in the format:
%channel-by-trials-by-stimulus
Response=zeros(1,36,11);
h=1; %couter for filling the 2nd dimension of the array Response
for i=1:11 %stimulus index
for j=1:11 %response index
count=C_cons(i,j);
while count>0
Response(1,h,i)=j;
count= count -1;
if h<36
h=h+1;
else
h=1; %resetting
end
end
end
end
%toolbox call
opts.method='dr';
opts.bias='qe';
opts.nt=36;
MI_bias_corr=information(Response,opts,'I')/3.46;
%bar plot
bar([MI_naive, MI_bias_corr]);
title('comparison between two calculation techniques','fontsize',20);
set(gca,'XTickLabel',{'naive','quadratic extrapolation'},'fontsize',15);
Adrian Radillo
38
Download