Applications of Hidden Markov Models - QuB

advertisement
Applications of Hidden Markov Models
to
Single Molecule and Ensemble Data Analysis
by
Lorin S. Milescu
03/27/03
A thesis submitted to the
Faculty of the Graduate School of State
University of New York at Buffalo
in partial fulfillment of the requirements for the
degree of
Doctor of Philosophy
Department of Physiology and Biophysics
Copyright by
Lorin S. Milescu
2003
2
Acknowledgements
3
Table of Contents
1. Introduction .................................................................................................................................. 8
2. Background ................................................................................................................................ 15
2.1 Dynamic state space models ................................................................................................ 16
2.1.1 Hidden Markov model ................................................................................................. 18
2.1.2 Linear Gaussian model (LGM) .................................................................................... 29
2.2 Bayesian inference............................................................................................................... 31
2.2.1 Likelihood calculation .................................................................................................. 33
2.2.2 State inference .............................................................................................................. 36
2.2.3 Parameter estimation .................................................................................................... 36
2.3 Basic algorithms .................................................................................................................. 38
2.3.1. Likelihood calculation ................................................................................................. 39
2.3.1.1 Forward-Backward procedure for HMM .............................................................. 39
2.3.2. Most likely sequence of states ..................................................................................... 42
2.3.2.1 Viterbi algorithm for HMM .................................................................................. 42
2.3.2.2 Forward-Viterbi algorithm for aggregated HMM ................................................ 44
2.3.2.3 Kalman filter-smoother for LGM ......................................................................... 45
2.3.3 Parameter estimation .................................................................................................... 47
2.3.3.1 Expectation-Maximization .................................................................................... 49
2.3.3.2 Baum-Welch algorithm for HMM ........................................................................ 51
2.3.3.3 Segmental k-means algorithm for HMM .............................................................. 53
2.3.3.4 Expectation-maximization algorithm for linear Gaussian models ........................ 55
3. A maximum likelihood method for kinetic parameter estimation from single channel data ..... 56
3.1 Algorithm ............................................................................................................................ 60
4
3.1.1 Parameterization and constraints .................................................................................. 60
3.1.2 Likelihood function ...................................................................................................... 63
3.1.3 Computation and derivatives of the likelihood function .............................................. 66
3.1.4 Speeding up the computation ....................................................................................... 75
3.2 Results ................................................................................................................................. 77
3.3 Discussion............................................................................................................................ 78
4. Signal detection and baseline correction of single channel records ........................................... 79
4.1 Hidden Markov models with stochastic parameters or with mixed states ........................... 80
4.2 Baseline linear Gaussian model ........................................................................................... 81
4.3 Algorithms for baseline correction ...................................................................................... 88
4.3.1 Viterbi-Kalman algorithm ............................................................................................ 92
4.3.2. Forward-Backward-Kalman algorithm ....................................................................... 99
4.3.2.1 Initialization ........................................................................................................ 102
4.3.2.2 Expectation ......................................................................................................... 105
4.3.2.3 Maximization ...................................................................................................... 105
4.4 Results ............................................................................................................................... 106
4.4.1 Idealization of simulated data without baseline artifacts ........................................... 111
4.4.2 Idealization and baseline correction of simulated data with stochastic baseline
fluctuations .......................................................................................................................... 119
4.4.3 Idealization and baseline correction of simulated data contaminated with sinewave
artifacts ................................................................................................................................ 137
4.4.4 Idealization and baseline correction of experimental data ......................................... 158
4.5 Discussion.......................................................................................................................... 163
5. Infinite state-space hidden Markov models. Application to single molecule fluorescence data
from kinesin ................................................................................................................................. 166
5.1 Infinite state-space hidden Markov model ........................................................................ 170
5
5.2 Kinesin stepping model ..................................................................................................... 175
5.3 Algorithms for staircase data ............................................................................................. 175
5.3.1 Position estimation – data idealization ....................................................................... 176
5.3.1.1 Modified Forward-Backward-Kalman procedure for staircase data ................... 178
5.3.1.2 Re-estimation ...................................................................................................... 182
5.3.2 Kinetic parameters estimation .................................................................................... 185
5.3.2.1 Parameterization and constraints ........................................................................ 186
5.3.2.2 Likelihood function............................................................................................. 187
5.3.2.3 Computation and derivatives of the likelihood function ..................................... 189
5.3.2.4 Speeding up the computation .............................................................................. 190
5.4 Results ............................................................................................................................... 191
5.4.1 Kinetics estimation of simulated data ........................................................................ 192
5.4.1.1 Bias of the estimate ............................................................................................. 193
5.4.1.2 Precision of the estimate ..................................................................................... 195
5.4.1.3 Effect of jump truncation .................................................................................... 195
5.4.1.4 Effect of A matrix truncation order .................................................................... 197
5.4.1.5 Effect of backward rate ....................................................................................... 198
5.4.1.6 Log-likelihood surface ........................................................................................ 199
5.4.1.7 Reversible model ................................................................................................ 200
5.4.2 Data idealization......................................................................................................... 202
5.4.2.1 Idealization of simulated data ............................................................................. 203
5.4.2.2 Idealization of experimental data ........................................................................ 226
5.5 Discussion.......................................................................................................................... 245
6. Ensemble hidden Markov models. Applications to kinetic analysis of whole-cell recordings 247
6.1 Ensemble hidden Markov model ....................................................................................... 248
6.1.1 Conductance properties of an ensemble of ion channels............................................ 249
6
6.1.2 State fluctuations of an ensemble of ion channels ..................................................... 251
6.2 Algorithm for ensemble hidden Markov model ................................................................ 252
6.2.1 Likelihood function .................................................................................................... 253
6.2.2 Parameterization and constraints ................................................................................ 254
6.2.3 Computation and derivatives of the likelihood function ............................................ 255
6.3 Results ............................................................................................................................... 257
6.3.1 Bias of kinetics estimation ......................................................................................... 257
6.3.2 Simultaneous estimation of kinetics and channel count ............................................. 259
6.3.3 Effect of conductance parameters .............................................................................. 260
6.3.4 Effect of channel count .............................................................................................. 260
6.3.5 Conditioning-test experiments and estimation of starting probabilities ..................... 261
6.3.6 Estimating channel count ........................................................................................... 263
6.4 Discussion.......................................................................................................................... 278
References .................................................................................................................................... 279
7
1. Introduction
Recording data from individual molecules started in the late 70’s, with the advent of the
single channel patch clamp technique (Sakmann and Neher, 1992, 1995). Ionic fluxes through
individual ion channel molecules in a cellular membrane are recorded as currents in the pA range.
The development of the technique was accompanied by the development of stochastic models
(Colquhoun and Hawkes, 1977, 1981; Ball and Rice, 1992 and references therein; Colquhoun and
Hawkes, 1995 and references therein) and statistical methods of analysis (Horn and Lange, 1983;
Colquhoun and Sigworth, 1995b and references therein). Starting in the late 80’s, aggregated
hidden Markov models were applied to single channel data analysis (Chung et al, 1990, 1991). In
the following years, Markov model-based algorithms have become the state of the art analysis
tool of single ion channel research (Albertsen and Hansen, 1994; Ball et al, 1999; Colquhoun et
al, 1996; De Gunst et al, 2000; Jackson, 1997; Klein et al, 1997; Michalek and Timmer, 1999;
Qin et al, 1996, 1997, 2000a, b; Venkataramaran et al, 1998a, b, 2000; Wagner et al, 1999).
A Markov model (Elliott et al, 1995) describes a dynamic process characterized by a
finite number of discrete states. Only instantaneous transitions between states are possible, with
the probability of transition proportional to a rate constant. When the state of the process cannot
be observed or measured directly, but through another noisy variable, the Markov model is called
hidden (Rabiner and Juang, 1986; Juang and Rabiner, 1991). States with identical observation
properties are called aggregated. The lifetime of a state has an exponential probability
distribution, while the lifetime of a group of aggregated states has a multi-exponential distribution
(Ball and Rice, 1989; Fredkin and Rice, 1986; Colquhoun and Hawkes, 1987, 1995; Hawkes et
al, 1990).
As a representation of the physical world, the Markov process is continuous in time but,
of necessity observed in the discrete time domain. Typically, an analog-to-digital converter
8
samples a (noisy) observation of the Markov state at regular time intervals. What occurs between
two sampling points is unknown but can be statistically inferred. Due to the exponential nature of
the lifetime distribution, unobserved state transitions will occur between two sampling points.
The reconstitution of a continuous sequence of dwells from a discrete Markov signal is thus
plagued by the “missed events” problem (Ball and Sansom, 1988; Ball et al, 1993; Hawkes et al,
1990; Roux and Sauve, 1985).
A discrete time first-order Markov model is memory-less: the state at any time t is a
stochastic function of only the previous state. Due to the finite sampling interval, any state at any
time t can be followed by any other state at time t+1 where, for convenience, t is normalized by
the sampling interval t. Therefore, in the discrete time domain, transitions between any two
states are allowed within the sampling interval, while in continuous time, only some transitions
are permitted. The continuous-time state transitions are quantified by the generator matrix, noted
Q (Colquhoun and Hawkes, 1995), while the discrete time state transitions are quantified by the
transition probability matrix, noted A (Colquhoun and Sigworth, 1995a).
It is easy to see how hidden, aggregated Markov models are an accurate description of
ion channel molecules. The continuum of conformational states the molecule may assume can be
reduced to a finite and usually small number of long-lived states. These states live long enough to
be detected with the available instrumentation, which is inherently limited in terms of bandwidth.
Historically, ion channels were modeled with two states, closed and open, but as instrumental
resolution was improved, more complex Markov models became necessary. For example,
modeling the acetylcholine receptor kinetics must account for closed and open states with zero,
one or two ligand molecules bound, and for desensitized states. All the closed states are
aggregated into a class of closed conductances and all the open states are aggregated into a class
of open conductances. The presence of visibly conducting substates required additional classes.
Many structure/function properties of the ion channel molecule can be inferred from the
single channel current by statistical analysis. Such analysis is generally conducted with the
9
respective aims of estimating conductance and kinetics. The conductance parameters describe the
characteristics of the current in each conductance class. Typically, the current is assumed to have
a Gaussian distribution with non-correlated noise, and a stationary mean and standard deviation.
The probability of an instantaneous transition between two kinetic states is quantified by a
macroscopic rate constant. The rate constant is a function of two intrinsic factors and other
physical quantities (such as ligand concentration, trans-membrane voltage, pressure etc) that are
independent variables. The estimation of the intrinsic factors is the goal of kinetic analysis.
In recent years, investigation of other single molecules became possible. One example is
the class of motor proteins, including the kinesin and the myosin (Vale and Milligan, 2000). A
motor protein converts chemical energy into mechanical energy, by hydrolyzing ATP. The
mechanical energy is used for moving (usually in discrete steps) on a specific substrate, and
carrying an attached molecular load. Using fluorescence experiments, the position of individual
motor proteins can now be measured with accuracy, in the nanometer range (Selvin, 2002). In the
sense of allosteric behavior, a motor protein is no different from an ion channel, and can be
described with a hidden Markov model. Therefore, it seems natural to attempt using the same or
modified versions of the Markov model-based algorithms, originally developed for ion channels
to analyze other single molecule experiments. With molecular motors, the recorded quantity is
either the position of the molecule or the force required to maintain its position. If no external
force is applied, each time the motor takes a step, the observed position changes. Thus, a
recording of molecular motor steps has a staircase appearance. In addition to conformation, the
state of the molecule must be augmented with the position on the substrate. The position takes
discrete values but the number of possible values can be large, sometimes essentially infinite if
the substrate is long. In principle, this requires a Markov model with a large number of states, but
since the chemistry of each step is the same, the model is reducible. Markov algorithms
developed for ion channel analysis can be modified to be applied to staircase data.
10
Often, the membrane patch in a single channel experiment contains several ion channels.
The superposition of an ensemble of ion channels can still be modeled with a Markov model
(Horn and Lange, 1983; Albertsen and Hansen, 1994; Qin et al, 1996; Klein et al, 1997;
Venkataramaran, 1998a). The superposition model has (many) more states, termed meta-states,
but the same number of conductance and kinetics parameters. While the number of free
parameters is the same, the degree of uncertainty in the data grows with the number of channels,
resulting in estimates of lower precision. Furthermore, the computation takes longer, depending
exponentially on the number of meta-states. In whole-cell experiments the number of channels
contributing to the measured signal is very large, in the order of thousands, tens of thousands or
millions. Markov modeling by means of a superposition model is certainly possible, but
unrealistic. Thus, the traditional method of extracting kinetic parameters from whole-cell
recordings is the fitting of exponentials to mean currents.
The present work extends the application of Markov models from single ion channels to
other single molecules and to ensembles of molecules. My perspective is essentially practical, and
all theoretical developments have been translated into algorithms and implemented in software.
Experimental data were not available for some approaches, and thus part of this work utilizes
computer simulated data. In the rest of this introductory chapter, I present my original
contributions to the field.
I have modified an existing method for maximum likelihood estimation of rate constants
from single channel data (Qin 2000a). I call the new algorithm MIP (Maximum Idealized Point
likelihood). The modification extends the applicability of the method to non-stationary data. For
non stationary (non equilibrium) data, from the current estimate of the rate constants the starting
probabilities are calculated as the equilibrium probabilities under the conditioning stimulus. The
computation of the likelihood function is modified to accommodate as many sets of rate constants
as there are experimental conditions affecting the data. The derivatives of the likelihood function
are calculated analytically, as in the original version (Qin et al, 2000a). Other features of the
11
original method are preserved, such as the possibility of imposing linear constraints on the
estimation of rate constants. While the original algorithm works with raw data, MIP can also
analyze idealized data. By pre-calculation of certain quantities, the method is accelerated and
approaches the speed performance of the maximum interval likelihood method (Qin et al, 1996,
1997).
Data idealization is the procedure by which the underlying Markov signal is extracted
from the background noise. Along with the traditional half-amplitude threshold method, several
heuristic or hidden Markov algorithms exist (Chung et al, 1990, 1991; Schultze and Draber,
1994). I have combined the Forward-Backward procedure (Baum et al, 1970) and the Viterbi
algorithm (Viterbi, 1967; Forney, 1973) into a new method for idealization of data generated by
aggregated Markov models. The new procedure, termed Forward-Viterbi, finds the most likely
sequence of conductance classes. In comparison, Viterbi finds the most likely sequence of states.
Single channel records are often contaminated with baseline artifacts. Drift of the zerolevel current due to cellular or instrumental causes, interference from the power line resulting in
60 Hz and harmonic contamination, or uncharacterized random fluctuations of the baseline level
may corrupt the experimental data. Data cannot be analyzed if the baseline uncertainty is too
large. While slow and constant drift can be corrected manually or with software (Sachs et al,
1982; Baumgartner et al, 1998; Chung et al, 1991), random fluctuations cannot. To improve
baseline tracking, I have modeled baseline fluctuations with a linear Gaussian model, also known
as the Kalman filter (Kalman, 1960). I have designed and implemented two maximum likelihood
methods that perform idealization with baseline correction for single channel data. They are based
on coupling the Kalman filter with either the Viterbi algorithm or the Forward-Backward
procedure. The resulting methods are optimal for genuinely stochastic baseline fluctuations but
also work well with deterministic drift or sinusoidal interference.
I have adapted hidden Markov model algorithms to staircase style data. I replaced the
infinite state hidden Markov model with truncated model reflecting the chemistry of each step.
12
The degree of truncation depends on the data. The discrete levels present are detected with a
modification of the baseline correction algorithm. The new idealization method deals with
wideband noise and baseline drift, and steps that need not have identical amplitude. A modified
version of the MIP algorithm estimates rate constants from the idealized data. Use of a truncated,
finite state model for kinetics estimation is complemented by a modification in the likelihood
function, by which the vector of state occupation probabilities can be reset after each data point.
All the features of the original algorithm are preserved, such as imposition of linear constraints on
rate constants, fitting across different experimental conditions, etc. To my knowledge, this is the
first time an infinite state-space Markov model with the truncation has been applied to the study
of single molecules.
I have extended application of hidden Markov models to the kinetic analysis of ensemble
data, such as obtained in whole-cell recordings. Instead of fitting exponentials, I develop a
method for direct estimation of rate constants rather than time constants. More importantly, I use
a likelihood function that takes into account the sources of variance in data: fluctuations in the
number of channels occupying each state and the excess state noise. If the state noise is a
convolution of Gaussians and therefore another Gaussian, the channel fluctuations are described
by a multinomial probability distribution. A convolution of the two not being practically feasible,
I approximate the multinomial with a Gaussian, and obtain the convolution simply as another
Gaussian. The fluctuation (the variance) of data around the conditional mean is a function of the
number of channels in the patch. The benefit of including the variance in the likelihood function
is that it makes possible a simultaneous estimation of rate constants and of the number of active
channels in a patch (or cell). This, to my knowledge, is a novel model based procedure. The only
prerequisite for analysis remains, as expected, knowledge of the individual mean channel current
and its excess state noise, although detailed knowledge of the latter proves to be not critical. As a
matter of principle, to permit rate constants to be estimated, the data must be non-stationary. The
ensemble algorithm deals with non-stationarity in the same way as MIP. Importantly, since the
13
ensemble data has fewer details than single channel data, the number of free parameters in the
model must be reduced by constraints, and this has been implemented as with MIP and other
single-channel algorithms.
The following chapters are organized as follows: Chapter 2 gives a background on
dynamic state space models, Bayesian inference and standard algorithms; Chapter 3 describes the
MIP algorithm and introduces some concepts used later; Chapter 4 presents the procedure for
idealization and baseline correction of single channel data; Chapter 5 describes the new methods
for single molecule staircase data analysis; Chapter 6 presents the hidden Markov model
algorithm for kinetic modeling from ensemble data. The reader should take note that concepts
introduced in earlier chapters are not repeated. Therefore, chapters are best read in order.
Throughout this text, I used bold notation for vectors, matrices, sequences, sets and
symbols denoting other complex objects, such as parameters, optimized variables etc. Matrices
and vectors are generally written in uppercase. Scalars and elements of vectors and matrices are
written in bold and italic in text and just italic in mathematical expressions, and generally in
lowercase.
14
2. Background
In this section I discuss the mathematical and algorithmic prerequisites. I start by
introducing dynamic state space models, of which the hidden Markov model (HMM) is an
important instance (Rabiner and Juang, 1986). I use the hidden Markov modeling framework to
derive new methods for direct estimation of rate constants from single channel data (Chapter 3),
single molecule fluorescence data (Chapter 5) and ensemble ion channel data (Chapter 6), and for
signal detection and baseline correction of single channel and single molecule recordings
(Chapters 4 and 5). I present the linear Gaussian model (LGM), also known as the Kalman filter
(Kalman, 1960) and apply it to single channel and single molecule data analysis. I combine the
linear Gaussian with the hidden Markov model into new algorithms for signal detection,
parameter re-estimation and baseline correction of single channel and single molecule records.
Next, I review briefly some concepts in Bayesian statistics (Bayes, 1783) pertinent to
dynamic state space models. The computation of the likelihood of a data set generated by a
dynamic state space model is presented in the light of Bayesian estimation (West and Harrison,
1997). I review two of the most important problems of dynamic state space models: state
inference or signal detection, and maximum likelihood and maximum a posteriori parameter
estimation. Finally, I present the dynamic programming algorithms used to solve these problems,
namely the Forward-Backward procedure (Baum et al, 1970), the Viterbi algorithm (Viterbi,
1967; Forney, 1973), the Kalman filter-smoother (Kalman, 1960; Rauch, 1963), and the
Expectation-Maximization algorithm (Dempster et al, 1977) and two of its instances, BaumWelch (Baum et al, 1970) and Segmental k-means (Rabiner et al, 1986). I give a short description
of a new state inference algorithm for aggregated Markov models, which I call Forward-Viterbi.
15
2.1 Dynamic state space models
A dynamic state space model describes the time progression of a state variable X. The
state may not be directly observed, in which case it is termed hidden. Instead, a measurement
variable Y, which is a function of X, is observed. The state is hidden if a one-to-one mapping
between the state and observation does not exist. I will discuss the concept further in the next
section on Markov models. It suffices to say here that, when Y is a probabilistic function of X,
different states may give the same observation, with identical or different probability
distributions.
Models with both hidden and observed variables fall into the incomplete or missing data
category (Little and Rubin, 1987). X is the hidden or missing data and Y is the incomplete data.
Together, X and Y form the complete data. Parameter estimation from incomplete data could be
very difficult or even impossible but it is usually straightforward when the hidden state is known
or statistically inferred. X and Y take continuous or discrete values. In some cases, the variables
are mixed, with continuous and discrete values. The state and the measurement variables are
represented in continuous or discrete time. I focus the discussion on discrete time models.
A discrete time dynamic state space model is described by two equations. The process
equation gives the discrete time evolution of the state Xt at time t as a deterministic function F of
the previous states and previous measurements. A process random variable t, characterized by a
probability density/mass function, adds stochastic innovation to the deterministic function F:
Xt  F X 0 , X1 ,..., Xt 1 , Y0 , Y1 ,..., Yt 1   t 1
16
[2.1.1]
The measurement equation gives the measurement Yt at time t as a deterministic function
G of the current and previous states and of previous measurements. A measurement random
variable t, characterized by a probability density/mass function, adds noise to the measurement:
Yt  G X 0 , X1 ,..., Xt , Y0 , Y1 ,..., Yt 1   t
[2.1.2]
Dynamic state space models are powerful representations of physical phenomena, and
enjoy a widespread application in both engineering and science. This is especially true for the
hidden Markov model (HMM) and the linear Gaussian model (LGM).
A useful representation of the conditional structure of dynamic state space models is
through graphical models (Jordan, 1998; Lauritzen, 1996; Whittaker, 1990). Models are
represented by acyclic graphs (Figure 2.1.1): nodes indicate state or measurement variables and
arrows indicate conditional dependencies. The variables could be discrete or continuous, and the
conditional dependencies could be probabilistic or deterministic.
xt-1
xt
xt+1
xt-1
xt
xt+
1
yt-1
yt
HMM
yt+
yt-1
1
yt
LGM
yt+
1
Figure [2.1.1]. Graphical representation of two types of dynamic state space models: hidden Markov
model (HMM) and linear Gaussian model (LGM). Circles indicate continuous and squares indicate
discrete states or measurements. Clear symbols represent hidden and shaded symbols represent
observed variables. The observation at time t, yt, is a probabilistic function of the state xt at time t.
The basic difference between the two models is the nature of the states: discrete for HMM and
continuous for the LGM.
17
2.1.1 Hidden Markov model
A discrete state, discrete time, first order hidden Markov model describes a stochastic,
memory-less process. The state variable X takes any of NS discrete values, say the integers from 1
to NS. NS is the total number of states of the model. The state X is hidden and the process is
observed through an observation variable Y. Y is a continuous or discrete probabilistic function
of X. I consider here the case of a scalar Y. Although the process could be continuous in nature, it
is observed (sampled) at discrete times only. Usually, an analog-to-digital converter records Y,
with a sampling time t. The discrete process is represented as a temporal chain of states. The
state at time t+t, Xt+1, depends solely on the state at time t, Xt.
X is conventionally represented as a vector of dimension NS. When applied to ion
channels (or other single molecules), each conformational state of the channel is mapped into one
of the NS discrete values of X. Y is the measured current, generally assumed to have a Gaussian
probability distribution, with mean and variance dependent on the state X.
The state vector can store the absolute knowledge about the state. In this case, each
element of X, xj, is zero, except the element i, corresponding to the known state, which is one. It
can also represent the belief or the relative knowledge about each state. Then each element of X is
equal to the probability that the model is in state i. In both situations, the sum of all elements is 1.
The discrete time HMM is parameterized by three parameters (Rabiner, 1989): the initial
state probabilities, the state transition probabilities and the emission probabilities. I denote by 
the complete set of parameters:
Θ  π , A , B
18
[2.1.1.1]
 is the initial state probability vector, of dimension NS. Each element of , i, is the a
priori probability that the Markov chain starts in state i. With the obvious constraints that each
element is between zero and one and that the sum of all elements equals 1,  is a column vector1:
π i  P  x0  i 
[2.1.1.2]
0  πi  1
[2.1.1.3]
NS
π
i 1
i
1
[2.1.1.4]
There might be a priori reasons to assign certain values to each of the initial state probabilities.
For example, in the case of ion channel pulse-protocol experiments, one typically expects the
channel to start in a particular state. Thus, one assigns probability one to that state and zero to
others. In contrast, for equilibrium experiments, the only reasonable assumption is that the initial
probabilities are equal to the equilibrium occupancy probabilities.
A is the transition probability matrix, of dimension NS x NS. Each element of A, aij, is the
conditional probability that the model is in state j at time t+t, if it was in state i at time t. A is a
matrix, with the properties that each element takes values between zero and one and that the
elements of each row sum to one:
aij  Pxt 1  j | xt  i 
[2.1.1.5]
0  aij  1
[2.1.1.6]
NS
a
j 1
ij
1
[2.1.1.7]
1
A probability vector is termed stochastic. Likewise, a matrix with each row a probability vector
is termed row-stochastic.
19
A first order HMM has no memory of previous states. Thus, the state at time t+1 depends
solely on the state at time t (and not on previous states). The memory-less property is expressed
as:
Pxt 1 | x0 , x1 ,..., xt   Pxt 1 | xt 
[2.1.1.8]
Note that a higher order Markov model (of order k) does have memory of previous (k) states but
can still be reduced to a first order model by expanding the state space.
B is the vector of emission probability distributions, of dimension NS. Each element of B,
bi, represents the probability density/mass function of observing data y, conditional on state x=i.
For ion channels, the common assumption is that the emission probabilities are Gaussian and that
the noise is non-correlated (white). Non-correlated noise implies that the conditional probability
density of observing data at time t depends on the current state only and not on previous
measurements (nor on previous states):
p yt | x0 , x1 ,..., xt , y0 , y1 ,..., yt 1   p yt | xt 


p yt | xt  i   N i , i2 | yt
[2.1.1.9]
[2.1.1.10]
Noise correlation can be eliminated by expanding the Markov states (Qin et al, 2000b).
The hidden character of a Markov model is conferred first of all by the indirect
observation of state X through the measurement Y. This indirect observation is a probabilistic,
many-to-one mapping. Distinct states may probabilistically give the same observation.
Furthermore, several states may share the same observation probability density, for example
Gaussian distributions with identical mean and variance. These states are termed aggregated. An
20
aggregated Markov model is hidden even in absence of noise, because distinct states in the same
aggregation class give measurements with identical characteristics. Thus aggregation adds further
uncertainty to the problem of state estimation. For ion channels, the NS conformational states are
partitioned in NC conductance classes, based on recorded current mean and variance. The states
in the same conductance class cannot generally be distinguished although states of the same mean
amplitude can be distinguished if their variances are different.
Although the state transition probabilities are governed by the A matrix, the primary
quantities of physical interest are the rate constants. These are conventionally expressed as a
NS x NS rate matrix Q (also termed the generator matrix). Q has the properties that each offdiagonal element, qij, is the macroscopic rate constant of the direct transition between state i and
state j, and that each diagonal element, qii, is the negative sum of the off-diagonal elements of row
i, so that the sum of each row is zero. When there is no direct transition between i and j, qij equals
zero:
0  qij
qii   qij
i  j 
[2.1.1.11]
i, j 
[2.1.1.12]
j i
I illustrate the significance of Q and A and their relationship for a two state Markov model,
applied to a two state ion channel protein, as follows:
q12
O
C
q21
[2.1.1.13]
21
The closed and the open conformations of the ion channel are denoted C and O, respectively. The
channel opening and closing rates are q12 and q21, respectively. I use a basic result in probability
theory (see, for example, Cox and Miller, 1965): if the probability of a discrete event A depends
on a second discrete event B, then P(A) is obtained by marginalization with respect to B:
P A   P A, B 
[2.1.1.14]
B
P A   P A | B   PB 
[2.1.1.15]
B
Suppose B and A are the state probabilities at time t and t+t, respectively, each with two possible
values: closed and open. Then A is clearly a function of B. Thus the probabilities associated with
each possible state at time t+t are:
PCt t   PCt t | Ct   PCt   PCt t | Ot   POt 
[2.1.1.16]
POt t   POt t | Ct   PCt   POt t | Ot   POt 
[2.1.1.17]
Since at any time, the channel is either in C or in O, the following relations are obvious:
PCt   POt   1
[2.1.1.18]
PCt t   POt t   1
[2.1.1.19]
PCt  t | Ct   1  POt  t | Ct 
[2.1.1.20]
POt  t | Ot   1  PCt  t | Ot 
[2.1.1.21]
22
The expected number of transitions from state C into state O, and from state O into state
C, in a unit of time, is equal to q12 and q21, respectively. Hence, the expected number of
transitions within a time interval t is equal to q12t and q21t, respectively. q12 and q21 have
dimension of frequency (s-1), while q12t and q21t are dimensionless. t can be chosen small
enough as to make q12t and q21t quantities smaller than one. In the limiting case, when t0,
q12t and q21t are, respectively, the conditional probabilities that the channel makes a transition
from C to O within t, given that it was in C at time t, and likewise from O to C, given that it
was in O:
POt t | Ct   q12  t
[2.1.1.22]
PCt t | Ot   q21  t
[2.1.1.23]
After substitution in [2.1.1.16] and [2.1.1.17], the state probabilities at t+t are given by:
PCt  t   PCt   q12  t  PCt   q21  t  POt 
[2.1.1.24]
POt  t   POt   q12  t  PCt   q21  t  POt 
[2.1.1.25]
The terms in [2.1.1.24] and [2.1.1.25] can be rearranged and divided with t:
PCt  t   PCt 
 q12  PCt   q21  POt 
t
[2.1.1.26]
POt  t   POt 
 q12  PCt   q21  POt 
t
[2.1.1.27]
As t0, the following system of differential equations results:
23
dPC
 q12  PC  q21  PO
dt
[2.1.1.28]
dPO
 q12  PC  q21  PO
dt
[2.1.1.29]
The system can be written in a more compact, matrix form, using the Q matrix as defined
previously:
d
PC
dt
PO   PC
 q
PO   12
 q21
q12 
 q21 
dP T
 PT  Q
dt
The superscript
T
[2.1.1.30]
[2.1.1.31]
denotes vector transposition (the state vector is in column format). The above
expression is known as the Kolmogorov differential equation, with the Chapman-Kolmogorov
solution:
Pt  P0  expQ  t 
T
T
[2.1.1.32]
P0 is the initial probabilities vector. The derivation can be extended easily to models with any
number of states. Note that the equations governing the time evolution of a single molecule are
identical to the equations for a large number of molecules, as known from chemical kinetics.
For a discrete time hidden Markov model, the sampling time t is substituted for t in
[2.1.1.32], giving the transition probability matrix A as follows:
24
A  expQ  t 
[2.1.1.33]
The state at any time t+1 can be predicted from the state at time t. More generally, if the state at
any time t0 is known, then the state at any time t>t0 can be predicted:
Xt 1  Xt  A
[2.1.1.34]
Xt  k  Xt  A k
[2.1.1.35]
Xt t0  Xt0  expQ  t  t0 
[2.1.1.36]
T
T
T
T
T
T
The exponential of a matrix M is formally defined with the following power series:

Mi
i 0 i!
expM   
[2.1.1.37]
If the eigenvalues of the Q matrix are all distinct (most ion channel models have distinct
eigenvalues, one zero and the others negative), the matrix exponential A can be efficiently
obtained from Q by spectral decomposition (Moler and Van Loan, 1978):
NS
A   A i  expi  t 
[2.1.1.38]
i 1
Ai’s are the spectral matrices, obtained from eigenvectors, and i’s are the eigenvalues of Q.
Note that, in general, no element of A is zero. Even when a direct transition between
states i and j is not allowed (qij=0), there is a finite probability that within t such a transition
occurs (aij>0), using alternate pathways. As t approaches zero, the diagonal elements of A
25
approach 1, whereas the off-diagonal elements approach zero (the exponential of a zero matrix is
the identity matrix):
lim aii  1
[2.1.1.39]
lim aij  0
[2.1.1.40]
t  0
t 0
In contrast, when t approaches infinity, the elements of A approach the equilibrium occupancy
probabilities:
lim aij  Pj
[2.1.1.41]
t 
Process and measurement equations for a hidden Markov model can be written as:
process equation:

Xt  Multinomial Xt 1  A

T

measurement equation: Yt  Gaussian Xt  μ , Xt  V
T
T
[2.1.1.42]

[2.1.1.43]
For a model with two states, the probability distribution of Xt is binomial, with mean equal to
Xt-1A. When flipping a coin, the binomial probability of an outcome of head or tail is constant
and does not depend on the previous flip. In comparison, the probability of state occurrence for a
Markov model does depend on the previous state. For a model with more than two states, the
state probability is described by a multivariate binomial, i.e. the multinomial distribution.  and V
are vectors of dimension NS. Each element of  and V, i and Vi, is equal to the mean and the
variance of the Gaussian emission probability distribution for state i, respectively.
26
The reader is reminded here the significance of the multinomial distribution for discrete
state models. The multinomial is parameterized by a probability vector . Each element of , i, is
equal to the probability of occurrence for discrete outcome i, out of NS possible outcomes. For
example, the probabilities that a two state ion channel, at equilibrium, is closed (C) or open (O)
are equal to:
PC eq | Peq   Peq ,C
[2.1.1.44]
POeq | Peq   Peq ,O  1  Peq ,C
[2.1.1.45]
The equilibrium state occupancy vector, Peq, plays the role of . The multinomial distribution can
be applied to ensemble models too. For example, for a patch with NC two state, independent
channels at equilibrium, the probability that nC channels are closed and nO channels are open is
equal to:


P nC eq , nO eq | Peq 
NC !
Peq,C nC Peq,O nO
nC !nO !
[2.1.1.46]
For the general case, when NS discrete outcomes are possible, the multinomial
distribution gives the occurrence probability of a set of counts (n1, n2,…, n N S ), in NC independent
trials, as follows:


P n1 , n2 ,..., n N S | Φ 
NC !
NS
n !
i
i 1
27
NS

i 1
ni
[2.1.1.47]
The fraction in the right-hand side of the expression above is known as the multinomial
coefficient.
In the examples above, Peq can be replaced with a conditional state probabilities vector.
Thus, given the state probabilities vector at time t, Pt, the state at t+1 is described by a
multinomial with parameter PtA. Thus, the conditional probability that the channel is in state i at
time t+1, given Pt, is equal to:


Pst 1  i | Pt   Pt  A i
T
[2.1.1.48]
The univariate Gaussian probability distribution is parameterized by the mean  and the
standard deviation  (the square root of the variance), as follows:

N  ,
2

 1 x   2 

|x 
exp 
2

2

2 


1
[2.1.1.49]
The multivariate Gaussian probability distribution is parameterized by the mean vector  and the
covariance matrix V, as follows:
N μ, V  | x 
 1

T
exp   x  μ   V 1  x  μ 
 2

2V
1
[2.1.1.50]
The diagonal elements of V, Vii, represent the variance of each dimension of the Gaussian. The
off-diagonal elements of V, Vij, represent the co-variance between dimensions i and j of the
Gaussian.
28
2.1.2 Linear Gaussian model (LGM)
The state X of a discrete time linear Gaussian model (LGM) is a vector of dimension dx,
indirectly observed through the noisy observation vector Y, of dimension dy. X and Y take
continuous values. A linear Gaussian model is completely specified by the set of parameters 
and by the process and measurement equations (Roweis and Ghahramani, 1999):
process equation:
Θ  x 0 , V0 , A ,C ,Q , R
[2.1.2.1]
Xt  A  Xt 1  t 1
[2.1.2.2]
measurement equation: Yt  C  X t  t
[2.1.2.3]
x0 and V0 represent the a priori knowledge about the initial state of the process. The
initial state X0 is generally not known exactly and thus it is specified as a Gaussian probability
density function, with mean vector x0 and covariance matrix V0:
X 0  N x 0 , V0 
[2.1.2.4]
The more precisely the initial state X0 is known, the lower the entries in the covariance matrix V0
and vice versa. A is a dx x dx state transition matrix, which is not necessarily row stochastic. C is
the observation matrix, of dimension dy x dx.  is the process random Gaussian variable, with zero
mean and covariance matrix Q, of dimension dx x dx.  is the measurement random Gaussian
variable, with zero mean and covariance matrix R, of dimension dy x dy.
The LGM is a memory-less process, in the sense that the state at time t+1 is a function of
the current state only (similar to the HMM):
29
pXt 1 | X0 , X1 ,..., Xt , Y0 , Y1 ,..., Yt 1   pXt 1 | Xt 
[2.1.2.5]
From the process equation [2.1.2.2] it is obvious that the conditional probability density of Xt
given Xt-1 is a multivariate Gaussian:
pXt | Xt 1   N A  Xt 1 ,Q
[2.1.2.6]
The conditional probability density of Xt given a previous state specified with a probability
distribution of mean xt-1 and variance Vt-1 is another multivariate Gaussian:

pX t | x t 1 , Vt 1   N A  X t 1 ,Q  A  Vt 1  A T

[2.1.2.7]
The observation noise is non-correlated (white), thus the measurement at time t, Yt, depends only
on the state at time t:
pYt | X 0 , X1 ,..., Xt , Y0 , Y1 ,..., Yt 1   pYt | Xt 
[2.1.2.8]
The conditional probability density of observing Yt given Xt is a multivariate Gaussian:
pYt | Xt   N C  Xt , R 
[2.1.2.9]
The conditional probability density of observing Yt given a state specified as a probability
distribution with mean xt and variance Vt is another multivariate Gaussian:
30

pYt | x t , Vt   N C  x t , R  C  Vt  CT

[2.1.2.10]
2.2 Bayesian inference
Essentially, the Bayesian theory provides a way to improve/refine/change a prior belief or
knowledge about a quantity (for example, a priori parameter estimates), by taking into account
additional information (for example, experimental data). Most algorithms for dynamic state space
models rely on Bayesian formalism (Baum et al, 1970; Berger, 1985; West and Harrison, 1997;
Ghahramani, 2001). The fundamental result of Bayesian inference is written as (Bayes, 1783;
Box and Tiao, 1973):
Posterior 
Likelihood  Prior
Evidence
[2.2.1]
The Bayesian theorem can be applied to parameter estimation (Gelman et al, 1995;
Jensen, 1996; West and Harrison, 1997):
pParameters | Data  
pData | Parameters  pParameters
pData 
[2.2.2]
The prior, p(Parameters), represents the a priori knowledge about Parameters, before
observing data. The likelihood, p(Data|Parameters), represents the conditional probability of
having observed the data, given the parameters. The posterior, p(Parameters|Data), represents
the a posteriori knowledge about parameters, after the data was observed. The evidence, p(Data),
is the probability of having observed the Data, regardless of parameters. The evidence is
31
computed from the conditional probability of observing data given parameters, integrated over all
possible values of the parameters:
pData  
 pData | Parameters  pParametersdParameters
[2.2.3]
Parameters
It should be noted that the posterior, prior, likelihood and evidence are, in general, probability
distributions.
Expression [2.2.2] can be understood in terms of joint and conditional probabilities. If A
and B are two probabilistic events, then the following relations hold (Cox and Miller, 1965):
A  f B 
P A, B  P A | B PB
[2.2.4]
P A   P A, B 
[2.2.5]
B
P A, B  PB , A  P A | B PB  PB | A P A
P A | B  
A  f B 
P  B | A  P  A
P B 
P A, B  P A  PB
[2.2.6]
[2.2.7]
[2.2.8]
P(A,B) is the joint probability of A and B, and P(A|B) is the conditional probability of A, given B.
The expression of the posterior [2.2.2] can be derived along the same lines, by replacing A with
Parameters and B with Data, in expression [2.2.7]. The likelihood [2.2.3] is obtained from
[2.2.5], with the same replacements. If the parameter space is continuous, the sum is replaced
with an integral.
32
2.2.1 Likelihood calculation
I apply now Bayesian inference to hidden Markov and linear Gaussian models. I use a
parametric model M (either HMM or LGM), a parameter estimate , a sequence of observations
Y = {y0,…,yt,…,yT} and a sequence of hidden states X = {x0,…,xt,…,xT}. In view of parameter
estimation, I write Bayes’ theorem as:
pY | θ , M   pθ |  
pY | M 
pθ | Y , M  
[2.2.1.1]
All the terms in the right hand side of equation [2.2.1.1] are not actual numbers but
probability distributions. Therefore, the posterior, p(|Y,M), must be interpreted as a probability
distribution too. For clarity of reading, I eliminate the model M from further expressions.
Nevertheless, M should be considered an implicit factor.
Both the likelihood, p(Y|), and the evidence, p(Y), are functions of the hidden state X.
Since X is not known, due to measurement uncertainty (noise) and possibly aggregation, p(Y|) is
termed the incomplete data likelihood:
lIC  pY | θ 
[2.2.1.2]
Computation of the incomplete data likelihood requires marginalization with respect to
X, by which all possible sequences X and their intrinsic probability of occurrence are accounted
for. Then, for a continuous process, the incomplete data likelihood is:
lIC   pY , X | θdX
X
33
[2.2.1.3]
lIC   pY | X ,θ  pX | θdX
[2.2.1.4]
X
For hidden Markov models, where the state takes discrete values, the integral over the
state-space X is replaced with a sum, as follows:
lIC   pY | X ,θ pX | θ
[2.2.1.5]
X
The complete data likelihood is defined as the conditional joint probability of observing
data Y and a state sequence X, given parameters :
lC,X  pY , X | θ
[2.2.1.6]
lC,X  pY | X ,θ pX | θ
[2.2.1.7]
For a memory-less hidden Markov or linear Gaussian model, with non-correlated noise,
the two terms in the right-hand side of equation [2.2.1.7] are expanded into:
pY | X ,θ   p yt | y0 ,..., yt 1 , yt 1 ,..., yT , x0 ,..., xT , θ
[2.2.1.8]
t
pX | θ   pxt | y0 ,..., yT , x0 ,..., xt 1 , xt 1 ,..., xT , θ
[2.2.1.9]
t
Due to the memory-less character and to non-correlated noise, the following simplifications
apply:
34
p yt | y0 ,..., yt 1 , yt 1 ,..., yT , x0 ,..., xT , θ  p yt | xt ,θ
[2.2.1.10]
t>0:
pxt | y0 ,..., yT , x0 ,..., xt 1 , xt 1 ,..., xT , θ  pxt | xt 1 ,θ
[2.2.1.11]
t=0:
pxt | y0 ,..., yT , x0 ,..., xt 1 , xt 1 ,..., xT , θ  px0 | θ
[2.2.1.12]
The complete data likelihood takes the form:
T
lC, X  p y0 | x0 ,θ   px0 | θ    p yt | xt , θ   pxt | xt 1 , θ 
[2.2.1.13]
t 1
Expression [2.2.1.13] has specific forms for HMM and for LGM:
HMM: lC, X  bx0  y0    x0 
T
 b  y  a
t 1
xt
t
[2.2.1.14]
xt 1 xt
T
LGM:
lC, X  N C  x0 , R  |y0 N x 0 , V0  |x0  N C  x t , R  |yt  N A  x t 1 ,Q  |xt
[2.2.1.15]
t 1
The incomplete data likelihood for hidden Markov models is calculated differently and
will be presented in section 2.3.1.1.
In order to compute the evidence, p(Y) must be marginalized with respect to both state X
and parameters . This is accomplished by integrating the incomplete data likelihood with respect
to . Expression [2.2.1.1] changes to:


  pY | X ,θ   pX | θ dX   pθ 


X


pθ | Y  



  pθ dθ




p
Y
|
X
,
θ

p
X
|
θ
d
X
θ  X


35
[2.2.1.16]
2.2.2 State inference
State inference is the procedure by which probabilities (for discrete states) or probability
densities (for continuous states) are assigned to all possible states. The most likely sequence of
states, XML, is of foremost importance. XML is defined as the argument which maximizes the
complete data likelihood for a given parameter estimate :
X
ML
 arg max  pY | X ,θ  pX | θ
[2.2.2.1]
X
The most likely sequence of states has slightly different meanings for hidden Markov and
for linear Gaussian models. The XML of an HMM is simply a sequence of numbers, i.e. the state
index at each time. LGMs have continuous states and thus, by necessity, XML is a sequence of
probability densities. The increase in complexity is, however, minimal, since the probability
densities are Gaussian and can be specified, regardless of sequence length, with only two
parameters, namely the mean, xt, and the variance, Vt. This is a direct consequence of the
properties of random Gaussian variables (Cox and Miller, 1965): i). a linear transformation of a
Gaussian variable is another Gaussian, and ii). the sum of two Gaussian variables is another
Gaussian (i.e. the convolution of two Gaussian densities is a Gaussian density).
2.2.3 Parameter estimation
The maximum a posteriori (MAP) parameter estimate, MAP, is, by definition, the
argument  that maximizes the posterior in expression [2.2.1.1]:
36
θ MAP  arg max  pθ | Y 
[2.2.3.1]

Since the denominator in expression [2.2.1.1] is not a function of , it can be dropped from
maximization:
θ MAP  arg max  pY | θ   pθ 



θ MAP  arg max lIC  pθ 

[2.2.3.2]
[2.2.3.3]
The expressions above are obviously functions of the unknown state sequence X. When the state
sequence is known, a MAP estimate is obtained using the complete data likelihood, lC,X:


θ MAP  arg max lC, X  pθ 

[2.2.3.4]
Sometimes, there is no a priori knowledge about parameters, before data is available.
Equivalently, the parameter prior, p(), can be considered to be a uniform probability distribution,
which means that all possible values are equally likely. In this case, the parameter prior can be
dropped out of maximization. The maximum a posteriori becomes the maximum likelihood (ML)
estimate, ML. ML maximizes only the likelihood term (either the incomplete or the complete
data likelihood):
 
[2.2.3.5]
 
[2.2.3.6]
θ ML  arg max lIC

θ ML  arg max lC, X

37
Clearly, the likelihood term in expression [2.2.3.2] depends on the amount of data. When a large
amount of data is available then the likelihood term dominates the maximization and the
parameter prior becomes irrelevant. When data is scarce, however, the prior might determine a
smoother, better estimate.
In the field of single channel analysis, the ML parameter estimate is almost always
sought, due to lack of a parameter prior. Recent work, however, proves that MAP estimation is
possible, although at great computational cost (Gauvain and Lee, 1992; Ball et al, 1999; De Gunst
et al, 2001; Rosales et al, 2001).
2.3 Basic algorithms
In this section I review some of the standard algorithms for likelihood calculation,
parameter estimation and sequence estimation. Likelihood calculation is a necessary prerequisite
for parameter estimation and optimal sequence finding.
For hidden Markov models, the Forward-Backward algorithm calculates the incomplete
data likelihood and Viterbi calculates the complete data likelihood. Implicitly, both ForwardBackward and Viterbi find the most likely sequence of states, although differently defined. For
LGMs, the Kalman filter-smoother calculates the complete data likelihood and finds the most
likely sequence of states (probability densities).
The Expectation-Maximization (EM) algorithm finds maximum a posteriori or maximum
likelihood parameter estimate from incomplete data. The Baum-Welch algorithm is a particular
form of EM for maximum likelihood parameter estimation in HMMs. There are other variants of
EM, such as the generalized and the stochastic EM, as well as the Segmental k-means algorithm
for HMMs. General numerical optimization methods are an alternative to EM.
38
In the next section I discuss only hidden Markov models with continuous emission
probabilities, as they alone are relevant for the present work.
2.3.1. Likelihood calculation
Since hidden Markov models have discrete states, both the incomplete and the complete
data likelihood can be calculated. For linear Gaussian models, with continuous states, only
calculation of the complete data likelihood is meaningful.
2.3.1.1 Forward-Backward procedure for HMM
In order to calculate the incomplete data likelihood, lIC, the complete data likelihood
must be computed for every possible state sequence:
lIC    pY | X ,θ  pX | θ
[2.3.1.1.1]
X
The total number of possible sequences grows exponentially with the sequence length.
For a model with NS states and a sequence of length T, the number of combinations is NST.
Calculating the likelihood by iterating all possible sequences is generally impossible. Fortunately,
the Forward-Backward procedure (Baum et al, 1970), a dynamic programming algorithm,
calculates the likelihood with a computational cost that grows linearly with the sequence length.
Strictly speaking, only the forward part is necessary for likelihood calculation. I will also describe
the backward part since it is used in Baum-Welch (Baum et al, 1970) for parameter estimation.
39
For a given model M and parameter estimate , the Forward-Backward procedure
involves the computation of two recursive variables, termed  and , at each time t, for each state
i:
 ti  p y0 , y1 ,..., yt , xt  i | θ 
[2.3.1.1.2]
 ti  p yt 1 , yt  2 ,..., yT | xt  i, θ 
[2.3.1.1.3]
ti is the conditional joint probability of observing the data sequence up to time t and of the
process being in state i at time t, given parameters  ti is the conditional probability of observing
the data sequence from time t+1 to T, given the process is in state i at time t and the parameters
The recursivity, forward for  and backward for , is written and initialized as:
 NS

 i 1

 tj1   ti  aij   b j  yt 1 
[2.3.1.1.4]
   aij  b j  yt 1    t j 1 
[2.3.1.1.5]
 0i   i  bi  y0 
[2.3.1.1.6]
 Ti  1
[2.3.1.1.7]
i
t
NS
j 1
The reader is reminded that aij is an element of the transition probability matrix A; bj(.) is
an element of the emission probability vector B; i is an element of the initial probability vector
.
The incomplete data likelihood can be obtained from , from  or from both:
40
NS
lIC   Ti
[2.3.1.1.8]
i 1
NS

l    i  bi  y0    0i
IC

[2.3.1.1.9]
i 1
NS

l    ti   ti
IC

[2.3.1.1.10]
i 1
Other quantities calculated by the Forward-Backward procedure are:
 ti  pxt  i | y 0 , y1 ,..., yT , θ 
[2.3.1.1.11]
 tij  pxt  i, xt 1  j | y0 , y1 ,..., yT , θ 
[2.3.1.1.12]
ti is the conditional probability that the process is in state i at time t, given the whole observation
sequence Y and parameters tij is the conditional joint probability that the process is in state i at
time t, and in state j at time t+1, given the whole observation sequence Y and the parameters . 
and  are calculated as:
 ti 
 ti   ti
NS

j 1
 
ij
t
j
t
[2.3.1.1.13]
 t
j
 ti  aij  b j  yt 1    t j1
NS NS

k 1 l 1
k
t
 akl  bl  yt 1   
[2.3.1.1.14]
l
t 1
The Forward-Backward procedure is, in fact, an optimal filter-smoother algorithm for
discrete time dynamic models with discrete states. The Forward part obtains a filtered state
41
estimate, while the Backward part refines it into a smoothed state estimate, given the noisy data
and the model parameters.
A way of calculating  and  in matrix form is given in Chapter 3, along with
computational details and ways to prevent numerical over- or underflow. The reason behind the
choice of initializing  will also become apparent in Chapter 3.
2.3.2. Most likely sequence of states
For a hidden Markov model, the most likely sequence of states can be obtained in two
ways: with the Forward-Backward procedure or with the Viterbi algorithm. Forward-Backward
gives a most likely filtered or smoothed state at time t, as, respectively:
xtML  arg max  ti
[2.3.2.1]
i
xtML  arg max  ti
[2.3.2.2]
i
For aggregated HMMs, states in the same aggregation class have the same emission
probabilities. It is therefore realistic to find the most likely sequence of classes. It is interesting to
observe that, for ion channels, the optimal sequence is not necessarily limited to kinetic states or
to conductance classes. For example, in data from acetylcholine receptors, one might want to find
an optimal sequence of clusters of activity, separated by desensitized intervals.
2.3.2.1 Viterbi algorithm for HMM
42
One way to find the most likely state sequence is to calculate the complete data
likelihood for each possible sequence and to choose the maximum:
XML  arg max  pY | X, θ   pX | θ 
[2.3.2.1.1]
X
As indicated previously, this approach is unrealistic. Fortunately, the Viterbi dynamic
programming algorithm (Viterbi, 1967; Forney, 1973) finds XML in a much more efficient way,
with the computational cost a linear function of the sequence length. Like the Forward part of the
Forward-Backward procedure, Viterbi involves a recursive calculation of the likelihood. The
difference is that, whereas Forward calculates t+1j as a sum of previous ’s multiplied by the
corresponding transition and emission probabilities, Viterbi replaces the sum with a maximum:


 tj1  max  ti  aij   b j  yt 1 
i
[2.3.2.1.2]
It can be shown that t+1j is equal to the complete data likelihood for the state sequence that
terminates in state j at t+1 (Viterbi, 1967; Forney, 1973). Furthermore, this sequence has the
highest complete data likelihood of all possible sequences terminating in state j at t+1. To store
this optimal sequence, for each t and each j, an additional variable  records the most likely
previous state:
 t j 1  arg max  ti  aij   b j  yt 1 
i


The recursion is initiated at time zero as:
43
[2.3.2.1.3]
 oj   j  b j  y0 
[2.3.2.1.4]
From time T, XML can be reconstructed backwards. The optimal state for the last data
point is chosen as that which terminates the sequence with the highest score. The second to the
last optimal state is retrieved from the  variable, and so on, down to the first data point:
 
xT  arg max  Tj
[2.3.2.1.5]
j
xt   txt11
[2.3.2.1.6]
It is important to recognize that Viterbi is, in a sense, a filter. The back-tracing step must
not be mistaken for a second, smoothing pass. The back trace makes no statistical decision.
Viterbi is equivalent to the Forward calculation in that the inference for the state at time t uses
only the data points up to and including time t. In comparison, Forward-Backward infers the state
at time t from all the data available, albeit in two passes. Naturally, the state inference of a
smoothing algorithm is more accurate, as it will become obvious from the results section in
Chapter 4. A filter has its advantages though. First of all, it is a single pass and thus it is
approximately twice as fast. Second of all, it is simpler to code and, third of all, it is suited for
online applications.
2.3.2.2 Forward-Viterbi algorithm for aggregated HMM
Viterbi is a suboptimal algorithm for aggregated models, because states in the same
aggregation class cannot be distinguished based on observation probabilities. I very briefly
present here a new method, which I call Forward-Viterbi, for estimation of the most likely
44
sequence of classes. The algorithm is a simple combination of the Forward procedure with the
Viterbi algorithm. A detailed description is therefore not necessary. The Viterbi part finds a most
likely sequence of classes but does not choose a particular state within the class. Instead, the state
probabilities inside a class are advanced forward in time by the Forward part. The obtained
sequence is marginalized with respect to classes but not marginalized with respect to states. The
class sequence maximizes the class-complete, state-incomplete data likelihood.
For ion channels, the aggregation could be based on conductance classes but any other
arbitrary state partitioning may be used. The method could be applied, for example, to
identification of clusters of activity in acetylcholine receptor data.
2.3.2.3 Kalman filter-smoother for LGM
The complete data likelihood of a linear Gaussian model is:
T
lC, X  N C  x0 , R  |y 0  N x 0 , V0  |x 0  N C  xt , R  |y t N A  xt 1 ,Q  |x t
[2.3.2.3.1]
t 1
The Kalman filter-smoother (Kalman, 1960; Kalman and Bucy, 1961; Rauch, 1963;
Rauch et al, 1965) finds the sequence of states which maximizes [2.3.2.3.1]. Since the states of an
LGM take continuous values, the optimal sequence is by necessity obtained as a sequence of
Gaussian probability densities, each of mean xt and variance Vt. In other words, the state at any
time t is not known exactly but it is believed to be located at xt. The stronger the belief, the lower
the variance Vt and vice versa. Therefore, the optimal sequence maximizes the belief in the state
values, or, equivalently, minimizes the mean square error. Expression [2.3.2.3.1] can be written
as:
45
T
T
 

lC, X   py t | x t    px0    pxt | xt 1 
t 1
 t 0
 

lC, X  pY | X  pX
[2.3.2.3.2]
[2.3.2.3.3]
p(Y|X) alone in expression [2.3.2.3.3] is maximized by a state sequence where Yt=CXt, for any t.
In contrast, p(X) alone is maximized by a state sequence where Xt=x0, for any t. Obviously, the
optimal sequence must balance the two terms, p(Y|X) and p(X).
The Kalman filter-smoother is similar to the Forward-Backward procedure. The filter
step calculates the probability distribution of a state given past data – the equivalent of ’s in
Forward-Backward. The smoother step calculates the probability distribution of a state given the
whole data – the equivalent of ’s. The result is a minimum mean square error (MMSE) estimate
of the state sequence.
The filter step calculates in the forward direction the following recursive quantities
(Kalman, 1960; Kalman and Bucy, 1961):
~
xt  A  xˆ t 1
t>1
[2.3.2.3.4]
~
ˆ  AT  Q
Vt  A  V
t 1
t>1
[2.3.2.3.5]

[2.3.2.3.6]

~
~
K t  Vt  CT  C  Vt  CT  R
1
ˆt ~
x
xt  K t  y t  Χ  ~
xt 
[2.3.2.3.7]
~
~
ˆ V
V
t
t  K t  C  Vt
[2.3.2.3.8]
Kt is the so-called Kalman gain matrix. x̂ t and V̂t are the filtered estimates of the state’s mean
and covariance.
46
The smoother step calculates in the backward direction the following recursive quantities
(Rauch, 1963; Rauch et al, 1965):
 
1
~
ˆ A V
J t 1  V
t 1
t
[2.3.2.3.9]
ˆ t 1  J t 1  x t  A  x
ˆt
x t 1  x


[2.3.2.3.10]
~
T
ˆ J  V V
Vt 1  V
t 1
t 1
t
t  J t 1
[2.3.2.3.11]
ˆT
xT  x
[2.3.2.3.12]
ˆ
VT  V
T
[2.3.2.3.13]
xt and Vt are the smoothed estimates of the state’s mean and covariance.
The application of the Kalman algorithm is conditioned by strict conformity with the
linear Gaussian model. Any violation from the LGM assumptions (e.g. non-Gaussian transition or
emission probabilities, non-linear process or measurement equations) requires the use of an exact
algorithm for an approximate and tractable model (for example, the extended Kalman filter
(Balakrishnan, 1984; Sorensen, 1985)) or an approximate and tractable algorithm for an exact
model (for example, Monte Carlo filters (Carter and Kohn, 1996)).
2.3.3 Parameter estimation
In this section I review the most common methods of parameter estimation in discrete
time dynamic stochastic models, specifically for HMMs. As I have mentioned earlier, the state
sequence X is not directly observed – it is hidden. Instead, a many-to-one, non invertible
transformation of X is measured, i.e. the observation sequence Y. Naturally, not knowing the
state makes parameter estimation more difficult. If X were known, a maximum likelihood (ML),
47
or maximum a posteriori (MAP) parameter estimate of the complete data (Y and X) is obtained
by maximizing either of the following expressions:
θ MAP  arg max  pY | X , θ   pX | θ   pθ 
[2.3.3.1]
θ ML  arg max  pY | X , θ   pX | θ 
[2.3.3.2]


For HMMs, the ML estimate, at least, has analytical solutions for the set of parameters
Θ  π , A , B. The analytical solvability of the MAP estimate depends on the particular prior
used. It is clear though, that only the transition probabilities and not the transition rates can be
obtained this way.
The state sequence cannot be known because it is usually buried in noise and/or because
the model has aggregated states. Thus, even in absence of noise, only the aggregation class
sequence can be exactly known. In contrast, the state sequence must be inferred statistically.
Consequently, parameter estimates must be obtained by maximizing the incomplete data
likelihood:



θ MAP  arg max   pY | X , θ  pX | θ  pθ


 X

[2.3.3.3]


θ ML  arg max  pY | X , θ   pX | θ 

X

[2.3.3.4]
Notice the vicious circle: in order to infer the state sequence X, the parameters  must be
known. Vice versa, in order to find the parameters , the state sequence X must be known. This
circular reasoning can be solved, with an iterative procedure which alternates two steps. First, an
48
initial set of parameters 0 is chosen. Then a state inference step finds the most likely state
sequence Xi corresponding to the current parameters i. A maximization step follows, which reestimates the parameters i+1 from the current state sequence Xi. The iteration of inference/reestimation steps terminates when some convergence criterion is reached (although convergence is
not necessarily guaranteed). A simple pseudo-code description is:
initialize:

0, i:=0
repeat
inference:
Xi←i
re-estimation: i+1 ← Xi
i:=i+1
until convergence test
In the next sections I present the most common implementations and the theoretical
ground of this iterative scheme.
2.3.3.1 Expectation-Maximization
The problem of parameter estimation from incomplete data was formalized by (Dempster
et al, 1977). The solution is the general likelihood maximization method known as ExpectationMaximization2 (Dempster et al, 1977; Little and Rubin, 1987; McLachlan and Krishnan, 1997).
Thus, parameter estimates can be obtained by maximizing the conditional expectation of the loglikelihood of the complete data, given the incomplete data and a current parameter estimate ’.
This expectation is termed Q(, ’), and is defined as:
2
Also known as Estimate-Maximize
49
Qθ , θ'   Eln pX , Y | θ | Y , θ'    ln px , Y | θ  px | Y , θ' dx
[2.3.3.1.1]
X
The ML and MAP parameter estimates can be calculated as:
θ ML  arg max Q θ , θ' 
[2.3.3.1.2]
θ MAP  arg max Q θ , θ'   ln pθ 
[2.3.3.1.3]


EM consists in an iterative sequence of two steps: i) calculate the expectation Q(, ’) for
a given parameter estimate and ii) find a new parameter estimate , which maximizes the
expectation (or its sum with the logarithm of the parameter prior). A brief description of the
algorithm is:
initialize:
0, i:=0
repeat
expectation:
calculate Q(, i)
maximization: i+1:=arg max[Q(, i)]
i:=i+1
until convergence test
EM should be applied to those estimation problems in which the likelihood function is
difficult or impossible to maximize or to differentiate. The observed data in these problems is
considered a many-to-one function of a random unobserved variable, the knowledge of which
50
would make finding maximum likelihood or maximum a posteriori parameter estimates
straightforward.
The algorithm is proven to converge to a local maximum of the likelihood function
(Dempster et al, 1977; Wu, 1983). The main drawback of this powerful, robust algorithm is its
slow rate near the convergence point, although several ways of improving convergence speed
have been proposed (Meilijson, 1989; Aitkin and Aitkin, 1996; Jamshidian and Jennrich, 1993;
Liu and Rubin, 1994; Lange, 1995; Liu et al, 1998). Additionally, sometimes the maximization of
Q(, ’) has no analytical solution or it is difficult to compute. In this case, any increase in Q(,
’), instead of maximization, is guaranteed to converge to a solution, even though not so
efficiently. This procedure is known as the Generalized Expectation-Maximization algorithm
(Dempster et al 1977; Fessler and Hero 1994). When the expectation Q(, ’) cannot be
calculated analytically, it is replaced with a stochastic, approximate solution. This method is
known as the Stochastic Expectation-Maximization algorithm (see Marschner, 2001, and
references therein).
2.3.3.2 Baum-Welch algorithm for HMM
The Baum-Welch algorithm (Baum et al, 1970) performs maximum likelihood parameter
estimation in HMMs (maximum a posteriori variants are also known (Gauvain and Lee, 1992)).
Baum-Welch is a particular instance of EM, although it was developed before EM. The
expectation step of Baum-Welch uses the Forward-Backward procedure for state inference. The
re-estimation formulae obtain a new parameter estimate in the maximization step.
The Q function for HMMs replaces the integral over X with a sum:
Qθ , θ'    ln px , Y | θ  px | Y , θ' 
X
51
[2.3.3.2.1]
The joint probability of the data and the state sequence, p(x,Y|), is a product of three terms (see
[2.2.1.14]):
 
T
 T

px , Y | θ    x0   a xt 1xt    bxt  yt 
 t 1
  t 0

[2.3.3.2.2]
Thus Q is a sum of three independent terms:
  

Q θ , θ'    ln  x0  px | Y , θ'  
x
 T
ln a xt 1xt
x  
 t 1
  px | Y ,θ'  



[2.3.3.2.3]



   lnb  y   px | Y , θ' 
T
x

t 0
xt
t

The Baum-Welch re-estimation formulae find a new set of parameters in terms of the
variables calculated by the Forward-Backward procedure (see for derivation Baum et al, 1970;
Rabiner et al, 1986). Each of the three terms in expression [2.3.3.2.3] is maximized separately, by
setting its derivative with respect to , A and B, respectively, equal to zero. p(x|Y,’) is not a
function of  and can be dropped from maximization. The constraints in equations [2.1.1.3],
[2.1.1.4], [2.1.1.6] and [2.1.1.7] are imposed with Lagrange multipliers. The results are:
 i   0i
[2.3.3.2.4]
52
T
aij 

ij
t

i
t
t 1
T
t 1
[2.3.3.2.5]
T
i 
y
t 0
t
 t
[2.3.3.2.6]
T

t 0
t
T
 i2 
   y
t 0
t
 i 
2
t
[2.3.3.2.7]
T

t 0
t
Baum-Welch estimates transition probabilities (the A matrix) rather than transition rates
(the Q matrix). Recently, the EM algorithm was modified to give maximum likelihood estimates
directly of the Q matrix (Michalek and Timmer 1999). Despite its limitation, Baum-Welch
remains the best choice for standard HMM parameter estimation and for state inference.
2.3.3.3 Segmental k-means algorithm for HMM
The segmental k-means algorithm (Rabiner et al, 1986) for HMMs finds the complete
data maximum likelihood, or maximum a posteriori parameter estimates:
θ MAP  arg max max  pY | x , θ   px | θ   pθ 
[2.3.3.3.1]
θ ML  arg max max  pY | x , θ   px | θ 
[2.3.3.3.2]


x
x
53
Not unlike EM, SKM has an expectation step which uses Viterbi to find the most likely
sequence of states given the data and a parameter estimate. A maximization step re-estimates the
parameters from the data and the sequence found by Viterbi. The re-estimation formulae are
similar to the ones used in Baum-Welch:
 i  I  x0  i 
[2.3.3.3.3]
T
aij 
 I x
t 1
 i   I  xt  j 
t 1
T
 I x
t 1
T
i 
 y  I x
t
t 0
T
 I x
t 0
T
 i2 
 I x
t 0
t
t
t 1
[2.3.3.3.4]
 i
 i
[2.3.3.3.5]
 i
 i    yt   i 
2
t
T
 I x
t 0
t
 i
[2.3.3.3.6]
I(.) is the indicator function, which returns one if the argument is true, and zero if the
argument is false. Of course, the re-estimation formulae above can be implemented much more
efficiently in practice. Instead of the indicator function, one can simply sum over the appropriate
terms only. As Baum-Welch, SKM estimates the transition probabilities but not the transition
rates. Somewhat simpler than Baum-Welch, SKM should be used mainly in situations when the
complete data likelihood is strongly peaked for one sequence of states, or aggregation classes,
relative to other sequences. Such a situation may arise, for example, when the signal to noise ratio
is good, resulting in a predominant role for the emission probabilities. When accurate estimates
are desired or when the signal-to-noise ratio is low, Baum-Welch is the best choice.
54
2.3.3.4 Expectation-maximization algorithm for linear Gaussian models
An Expectation-Maximization algorithm for linear Gaussian models exists. Since I do not
use it in this work, I refer the interested reader to (Roweis and Ghahramani, 1999, and references
therein). It suffices to say that the maximum likelihood parameters of an LGM can be estimated
very conveniently in a fashion similar to the way Baum-Welch algorithm does it for the hidden
Markov model.
55
3. A maximum likelihood method for kinetic parameter estimation from
single channel data
Kinetic parameters can be estimated from single channel data with Markov model-based
algorithms, using continuous time or discrete time hidden Markov models. While the physical
process, i.e. the ion channel molecule, evolves continuously in time, only a discrete temporal
representation of it is available. The experimental signal is sampled at regular intervals, digitized
by an analog-to-digital converter and eventually processed with a computer. Hence, in order to
use a continuous time Markov model, the continuous signal must be somehow reconstructed from
discrete and noisy data. In comparison, discrete time Markov models work directly with the raw
data.
Continuous time methods have two significant advantages over discrete time methods.
First of all, transition rates (and not transition probabilities) are estimated directly from a
sequence of kinetic state or conductance class visits (dwells). Second of all, continuous time
methods are very fast, the reason being that their computational complexity scales linearly with
the number of dwells, which is generally much smaller than the number of data points, for a given
data set.
Despite these advantages, usage of continuous time methods is complicated by several
problems. First of all, algorithms presented in Chapter 2 reconstruct the Markov signal from
noise, but only at the sampling points. Due to the exponential distribution of the state lifetime,
multiple transitions may occur, probabilistically, between two sampling points but remain
unobserved (Colquhoun and Hawkes, 1995). The reconstitution of a continuous dwell sequence
from a discrete time Markov signal is thus affected by the missed events problem. Clearly,
whatever happens between two sampling points is unknown but nevertheless can be inferred
statistically. Thus, several methods of correction for missed events exist (Roux and Sauve, 1985;
56
Blatz and Magleby, 1986; Ball and Sansom, 1988; Crouzy and Sigworth, 1990; Ball et al, 1993;
Colquhoun et al, 1996; Qin et al, 1996). The exact correction (Colquhoun et al, 1996) is
computationally expensive, whereas the first-order correction (Qin et al, 1996) is efficient and
appears adequate for analysis of even fast kinetics.
Second of all, continuous time methods require the discrete time Markov signal to be
extracted from noisy data, which is not always possible. In addition to the simple amplitude
threshold method (Sachs et al, 1982), more sophisticated procedures for signal extraction are
available, such as the Hinkley detector (Draber and Schultze, 1994) and the hidden Markov
model algorithms described in Chapter 2. Essentially, each data point must be classified into one
kinetic state, or, if the model has aggregated states (with identical measurement characteristics),
into one conductance class. Obviously, in the total absence of noise, any data point can be
assigned with probability one to the correct conductance class. More realistically, in the presence
of moderate noise, most data points can still be classified correctly, with high probability. This
implies that the incomplete data likelihood function [2.2.1.7] has a (much) higher value for the
correct sequence of conductance classes, C*, mostly due to the measurement term p(Y|X,):
pY | X, θ  PX | θ |XC*  pY | X , θ  PX | θ |XC*
[3.1]
pY | X, θ |XC*  pY | X, θ |XC*  0
[3.2]
Therefore, maximizing the incomplete data likelihood [2.2.1.5] can be limited to those state
sequences in the class sequence C*, since all others have negligible contribution, regardless of the
value of the parameter . Moreover, only the P(X|) term needs to be maximized.
In contrast, when the signal-to-noise ratio is low, relations [3.1] and [3.2] no longer hold,
and the incomplete data likelihood is flat with respect to the class sequence. Hence, in order to get
unbiased kinetic estimates, all possible sequences must be considered. Therefore, continuous time
57
methods cannot be used, since, by definition, they work with a single conductance class sequence.
Hence, discrete time methods are the only possible choice.
One benefit of a discrete time method is that missed event corrections are necessary,
since no assumption is made about what happens between recorded samples. The Baum-Welch
algorithm (Baum et al, 1970), which is a particular instance of the more general ExpectationMaximization (Dempster et al, 1977), obtains maximum likelihood parameters from raw data,
using hidden Markov models. While Baum-Welch estimates transition probabilities, it can be
modified to obtain directly transition rates (Michalek and Timmer, 1999). EM-based methods are
accurate and robust but have a slow convergence rate (Dempster et al, 1977; Wu, 1983). The
most effective means of acceleration are gradient-based (Jamshidian and Jennrich, 1993; Aitkin
and Aitkin, 1996; Lange, 1995), but they lose some of the valuable EM properties, such as
simplicity and robustness.
MPL (Maximum Point Likelihood) is a fast maximum likelihood algorithm (Qin et al,
2000a), which estimates directly transition rates from raw data, using discrete time hidden
Markov models. The incomplete data likelihood is calculated with the Forward-Backward
procedure, and it is maximized numerically with a quasi-Newton method, which provides faster,
quadratic convergence. As in the MIL (Maximum Idealized Likelihood) algorithm (Qin et al,
1996), the likelihood function and its derivatives are calculated analytically, thus considerably
increasing the computation speed. Of critical importance are the analytical derivatives of the
matrix exponential with respect to transition rates (Ball and Sansom, 1989; Najfeld and Havel,
1994).
Obviously, a point by point algorithm would be much slower than a dwell-based method.
Hence, in spite of its mathematical competence, MPL is still computationally slower than MIL
and therefore is limited in practice to short, noisy data sets with fast kinetics. For any other data,
MIL is preferred, since it is equally accurate yet faster. Using model expansion (meta-models),
58
MPL can be applied to data with more than one channel, or with correlated noise and/or limited
bandwidth (Qin et al, 2000b).
The method I propose and describe in this chapter is broadly based on MPL, in the sense
that it estimates directly the rate constants of a discrete time Markov model. Unlike MPL,
however, it derives the estimates from a single conductance class sequence (idealized data). I
therefore refer to this method as MIP (Maximum Idealized Point likelihood).
MIP does not need missed event correction, since it operates with discrete Markov
models. Hence, it should perform better than MIL for fast kinetics, when the first order
approximation may not be adequate. However, it cannot compensate for signal extraction errors
caused by correlated noise or overfiltering, although the problem can be alleviated by using state
inference algorithms adapted to non-ideal data (Qin et al 2000b; Venkataramaran et al 1998a, b,
2000). In comparison, the missed event correction in MIL does accommodate, to some extent,
these idealization artifacts.
MIP gains much in speed over MPL, achieving a rate comparable to MIL’s. As MIL, it
must be preceded by signal extraction (MPL requires conductance parameter estimation, which is
computationally equivalent). Moreover, it is in principle possible to design an adaptive algorithm,
which uses raw data for sections with high noise or fast kinetics (such as bursts of activity) and
idealized data for the rest. As with MIL and MPL, linear constraints can be imposed on rate
constant estimation.
Although not conceptually new, MIP is supported by several arguments: it is an
independent way of checking the results obtained with other methods; it is relatively fast; it is
simple to implement; it is easy to apply to non-stationary data, such as the response of the ion
channel to an experimental protocol which modifies the kinetics. Furthermore, it is easy to
modify for other single-molecule applications, as indicated in Chapter 5. In the rest of this chapter
I present the theory behind incomplete data likelihood maximization and its implementation in
MIP. I extend the use of MIP to non-stationary data,
59
3.1 Algorithm
MIP estimates rate constants directly from idealized data, by maximizing the incomplete
data likelihood, computed by the dynamic programming Forward-Backward procedure, described
in Chapter 2. The numerical optimization is performed by the dfpmin algorithm (Fletcher and
Powell, 1963; Fletcher, 1981), as implemented in Press et al, 1992. Naturally, other numerical
optimization methods can be used, as long as they work with analytical derivatives of the cost
function, which is critical for a fast and stable optimization. The necessary mathematical
derivations and the computation procedure are described in the next sections.
3.1.1 Parameterization and constraints
Transitions between states are governed by macroscopic rate constants. Let qij be the
macroscopic rate connecting states i and j. In the general case, qij is a function of a preexponential factor kij0, a ligand concentration, Cij, an exponential factor kij1 and an exponential
quantity, Vij (for example voltage, pressure etc) (Hille, 1992):

qij  kij  Cij  exp kij  Vij
0
1

[3.1.1.1]
For convenience, when there is no ligand binding, Cij is made equal to one. Likewise, when the
voltage dependency is of no interest, or when the voltage across the data set is constant, kij1 is
made equal to zero, in which case the exponential term is lumped into the pre-exponential.
The parameters of interest for estimation are the pre- and the exponential constant factors
kij0 and kij1. While there is no mathematical constraint on the exponential factor kij1, the pre60
exponential factor kij0 must be positive. To constrain estimation of a positive kij1, the two
constants can be re-parameterized as follows:
 
vij  ln k ij
wij  kij
0
[3.1.1.2]
1
[3.1.1.3]
ln qij  vij  ln C ij  wij  Vij
[3.1.1.4]
Often, the physical nature of the ion channel demands that some constraints are imposed
on the rate constants. For example, if the two binding sites of the acetylcholine receptor are
presumed identical, then the rates of the two consecutive binding steps are constrained in the
following way:
k L 
L 
C 2
CL k
CL2
0
0
[3.1.1.5]
When both binding sites are free (state C), the probability that the ligand binds to either site is the
double of the binding probability when only one site is free (state CL). Hence, the CCL
transition rate is equal to twice the CL CL2 rate. Clearly, without a constraining mechanism,
the estimates of the two rates would not be in the correct, 2:1 ratio.
Constraints are extremely useful in simplifying the kinetics of complicated model and
include restatements of a priori knowledge. We always seek the minimal number of free
parameters to optimize. Whenever possible, constraints should be used, as each reduces by one
the number of free parameters. Typical constraints are: i). fix the pre- or the exponential factor;
ii). scale one pre- or exponential factor to another; iii). enforce loop microscopic reversibility of
loops. These constraints can be formulated as follows:
61
k ij0  ct
Fix k0:
 
ln k ij0  ct
[3.1.1.6]
vij  ct
Fix k1:
k ij1  ct
[3.1.1.7]
wij  ct
k ij0  ct  k kl0
Scale k0’s:
 
 
ln k ij0  ln k ij0  ct
[3.1.1.8]
vij  vkl  ct
Scale k1’s:
Loop:
k ij1  ct  k kl1
[3.1.1.9]
wij  ct  wkl  0
q  q
ij
 ln q
v
ij
ij
  ln q ji


 ln Cij  wij  Vij   v ji  ln C ji  w ji  V ji
v   w
ij
[3.1.1.10]
ji
ij

 Vij   v ji   w ji  V ji   ln C ji   ln Cij
ct denotes the constrained value. I denote by r the set (vector) of constrained variables, composed
of both vij and wij:
r  vij , wij 
[3.1.1.11]
The constraint expressions above can be written in matrix form as a set of linear equations:
62
cM  r  cV
[3.1.1.12]
cM is the coefficient matrix and cV is the constant vector. The singular value decomposition
(SVD) technique can be used to express the constrained variables r as a function of a set of
unconstrained, independent variables x (Qin et al, 2000a):
r  cA  x  cB
[3.1.1.13]
r
 cA
x
[3.1.1.14]
The matrix cA and the vector cB are determined from cM and cV, using SVD. x is the set of free
parameters to be optimized. The parameters of interest, i.e. the pre- and the exponential factors k0
and k1 are obtained from x.
3.1.2 Likelihood function
The incomplete likelihood of a sequence of data points, as calculated by the ForwardBackward procedure, can be written in matrix form as:


L  P0  B0  A0  B1   A1  B2 At 1 .Bt AT 1  BT   1
T
[3.1.2.1]
P0 is the state initial probabilities (column) vector, with properties identical to those of π,
in expressions [2.1.1.2] to [2.1.1.4]. Bt is a diagonal matrix, whose diagonal entries are the
Gaussian conditional emission probability densities, evaluated at data point t:
63
B t i , i   p yt | xt  i 
[3.1.2.2]
B t i , j   0
[3.1.2.3]


p yt | xt  i   N  i ,  i2 | yt
[3.1.2.4]
At is the transition probability matrix at time t. The time index is necessary for non-stationary
data, when the data is a function of a time-changing stimulus or when the data is gathered under
different conditions.
As it stands in expression [3.1.2.1], computing the likelihood function is equivalent to
propagating the knowledge about state probabilities along the data sequence, in the forward
direction, without normalization. The computation starts with the a priori knowledge of the initial
state probabilities, given by the (transposed) vector P0. The prior is updated into a posterior, using
the evidence, i.e. the first data point y0. The posterior of the current point is projected ahead, as
the prior of the next point, which is again updated using the evidence. The value of the likelihood
function is obtained by summing the state probabilities at the last point, which is achieved by
multiplication with the column vector of ones.
The equivalent prediction-correction formulae are:
~T
T
P0  P0
[3.1.2.5]
~T
ˆ T P
P
0
0  B0
[3.1.2.6]
predict:
~T ˆ T
Pt  P
t 1  At 1
[3.1.2.7]
correct:
~T
ˆ T P
P
t
t  Bt
[3.1.2.8]
terminate:
ˆ T 1
LP
T
[3.1.2.9]
initialize:
64
The predicted state probabilities are corrected by the evidence, i.e. by post-multiplication
with the B matrix. Clearly, the emission probability in [3.1.2.4] gives higher values for those
states that explain better the observation. In other words, data has evidence that supports certain
states more than others. Since B is diagonal, each element of P, Pi, is multiplied only with the
corresponding diagonal element of B, Bii. Thus, some state probabilities are increased relative to
others. Obviously, the more the prediction matches the evidence, point by point, the higher the
value of the likelihood.
Generally, post-multiplication by the B matrix does not preserve the column stochastic
character of the P vector. Therefore, in the computation of the likelihood function, the state
probabilities may decrease or increase rapidly, leading to numerical under- or overflow. This is
prevented by re-normalizing the state vector at each point and using the normalized vector for
computation of the next point. It can be shown that the likelihood is equal to the product of the
norms at each point. Naturally, this quantity can be extremely small or large too, thus the
logarithm of the likelihood should be calculated instead. The log-likelihood is equal to the sum of
the norms’ logarithms. The normalization of the corrected P is:
normalize:
Pt 
ˆ
P
t
[3.1.2.10]
NS
 Pˆ i 
i 1
t
The calculation of the normalized state vector can be regarded as an application of
Bayesian theory. The posterior state probability distribution is calculated as:
pxt 1  i | yt 1  
p yt 1 | xt 1  i   Pxt 1  i 
p yt 1 
65
[3.1.2.11]
The prior, the likelihood, the evidence and the posterior are:
prior:
Pxt 1  i   Pt  At i 
[3.1.2.12]
likelihood:
p yt 1 | xt 1  i   Bt 1 i , i 
[3.1.2.13]
evidence:
p yt 1    Pt  A t  Bt 1 i 
NS
[3.1.2.14]
i 1
pxt 1  i | yt 1  
posterior:
Pt  At  Bt 1 i
NS
 P  A
i 1
t
t
 Bt 1 i 
[3.1.2.15]
Notice that P(xt+1=i|yt+1) is a simple number, representing the probability distribution of
state x given observation y, evaluated for state i.
3.1.3 Computation and derivatives of the likelihood function
Estimating maximum likelihood parameters involves maximizing the likelihood function.
To do so efficiently and to have standard errors of the estimates, I use the Davidon-FletcherPowell algorithm (Fletcher and Powell, 1963), which is a quasi-Newton method with approximate
line search. The same method is used by MIL and MPL (Qin et al, 1996, 2000a). The algorithm is
known in the literature as dfpmin. I used the implementation in Press et al, 1992.
Two variables,  and , are defined, each a column vector of dimension NS. They are
initialized and calculated recursively,  in the forward and  in the backward direction, as
follows:
α 0  P0  B0
T
T
[3.1.3.1]
66


αt  P0  B0  A0  B1   A1  B2   A2  B3 At 1  Bt  [3.1.3.2]
T
T
α t 1  α t  At  Bt 1 
[3.1.3.3]
βT  1
[3.1.3.4]
βt  At  Bt 1 AT 1  ΒT   1
[3.1.3.5]
βt  At  Bt 1   βt 1
[3.1.3.6]
T
T
The likelihood L can be computed from  and/or , in any of the following ways:
L  αT  1
T

[3.1.3.7]

L  P0  B0  β0
[3.1.3.8]
L  α t  βt
[3.1.3.9]
T
As mentioned previously, the probabilities are normalized to prevent numerical under- or
overflow:
ˆ 0T  P0T  B0
α
[3.1.3.10]
s0   ̂ 0i
[3.1.3.11]
i
1

α 0  αˆ 0 
s0
[3.1.3.12]

ˆ t 1  α t  At  Bt 1 
α
[3.1.3.13]
st 1  ̂ ti1
[3.1.3.14]
i
67
1

α t 1  αˆ t 1 
st 1
[3.1.3.15]
The same scaling factors, st, calculated for the forward variable, are used for the
backward variable.  T is not scaled, and the rest are scaled as:

βT  βT
[3.1.3.16]

ˆ  A  B   β
β
t
t
t 1
t 1
[3.1.3.17]

1
β t  βˆ t 
st 1
[3.1.3.18]
Using the same scaling factors for  and  has the following consequences:
T 
α t  βt  1
T
s
t
[3.1.3.19]
L
[3.1.3.20]
t 0
From expressions [3.1.3.9] and [3.1.3.19], the following relation can be inferred:
α
T
t


T
 βt  L  α t  βt
[3.1.3.21]

[3.1.3.22]
And thus:
α
T
t

T
 At  Bt 1   βt 1  L  αt  At  Bt 1   βt 1
68
  T  A t  B t 1   
T  A t  B t 1 
 β t 1   L  α t 
 β t 1
 αt 
x
x


[3.1.3.23]
The derivative of the log-likelihood function can be obtained from the derivative of the
likelihood function, as follows:
 log L 1 L
 
x
L x
[3.1.3.24]
In the following derivation, I apply the chain rule of matrix differentiation:
 A  B  A
B

B  A
x
x
x
k

A1  A 2  A k    A1  A i1   A i   A i1  A k 
x
x

i 1 
[3.1.3.25]
[3.1.3.26]
The derivatives of the likelihood function and of its logarithm, with respect to a variable
x can be expressed in terms of  and  as:


T 1
L  P0  B 0
 T A t  Bt 1 


 β 0   α t 
 β t 1 
x
x
x

t 0 
T


 log L 1  P0  B 0
1 T 1  T A t  B t 1 


 β 0   α t 
 β t 1 
x
L
x
L t 0 
x

T
[3.1.3.27]
[3.1.3.28]
The un-normalized ’s and ’s can be replaced with their normalized counterparts, using
formula [3.1.3.23]. After simplification with L, it results:
69


T
 log L  P0  B 0  T 1   T A t  B t 1   

 β 0   α t 
 β t 1 
x
x
x

t 0 
[3.1.3.29]
If the B matrix is not a function of any of the optimized parameters, it can be taken out of
the differentiation sign:
T
 T 1   T A
 
 log L P0

 B 0  β 0   α t  t  B t 1  β t 1 
x
x
x

t 0 
[3.1.3.30]
As indicated in Chapter 2, the A matrix can be calculated from the Q matrix using the
spectral decomposition technique (Moler and Van Loan, 1978):
A  expQ  t    A i  expi  t 
[3.1.3.31]
i
Ai’s are the spectral matrices and i’s are the eigenvalues of Q. The derivative of the A matrix
with respect to a parameter x has the form (Najfeld and Havel, 1994):
A  expQ  t 
Q

  A k 
 A l  f kl t 
x
x
x
k
l
 t  e k t

f kl t    e k t  e l t
  
l
 k
if k  l  

if k  l  

[3.1.3.32]
[3.1.3.33]
In order to calculate the log-likelihood function and its derivatives, the following
quantities are required: the normalized forward and backward variables ( and ), the
70
normalizing coefficients (s), the eigenvalues () and the spectral matrices of the Q matrix (Ak),
the derivatives of the Q matrix with respect to the non-zero transition rates, and the vector of
initial probabilities (P0) and its derivatives with respect to the non-zero rate constants. In the
following, I present the required calculations.
The initial probabilities vector P0 can be specified in several ways. One is to let the user
assign values, in which case the derivative of P0 w.r.t. a parameter x is the zero vector:
P0
0
x
[3.1.3.34]
Alternatively, P0 can be the equilibrium distribution, satisfying conditions:
P0
 P0 .Q eq  0
t
[3.1.3.35]
P0i  1
[3.1.3.36]
P
i
0
1
[3.1.3.37]
i
Qeq is the matrix for which the equilibrium conditions hold. It could be the same as Q, or
different. For example, one could set a conditioning-test protocol, in which the starting
probabilities into the test pulse represent the equilibrium determined by the conditioning pulse
experimental conditions. It is then necessary to express P0 as a function of Qeq. Let R be a matrix
of dimension NS x (NS+1), formed by adding a column of ones to the right hand side of Qeq. P0 is
then (Colquhoun and Sigworth, 1995a):

P0  u  R  R T

1
71
[3.1.3.38]
u is a column vector of ones, of dimension NS. The derivative of P0 w.r.t a variable x can be
obtained in the following way:



R  RT  R  RT
x

 


R  RT   R  RT

 x

R  R   x R  R 

R  R   R  R   x R  R 
T 1
T 1
T



R  RT
x


1
  x I  0
1
[3.1.3.39]
T 1
T


 



R  RT   R  RT
  

 x


T
   RR




  R  RT

1
  R  R   x R  R 
T 1
T


1
1

 

P0

 u
R  RT
x
x


1

[3.1.3.41]
1
 

 R  T
 R 
R  RT  
R  R 

x
 x 
 x 

P0
 u   R  R T
x

[3.1.3.40]



R  RT   R  RT
 x




R  RT   R  RT
 x



0


1

1
[3.1.3.42]
[3.1.3.43]
T
[3.1.3.44]

1
[3.1.3.45]
T
  R  T
 R  
T

.
R  R 
  RR
  x 

x

 



1



[3.1.3.46]
The expression above may appear complex but its computational cost is small since the few
matrix multiplications and additions are required only once per iteration.
The log-likelihood is a function of the Q matrix, or of several Q matrices, one for each
set of experimental conditions that affect the ion channel response. For example, one may desire
to apply a sequence of voltage or concentration pulses, to make the channel visit more the
otherwise short-lived kinetic states. While each different concentration or voltage gives a
72
different Q matrix, the set of free parameters remains the same, as it does not depend on the
experimental conditions. Thus each Q matrix involved in the likelihood computation is a function
of the same free parameters.
In the general case, the derivative of the log-likelihood function LL, with respect to a free
parameter xk, has the form:
c
c

LL
  LL  qij vij qij wij  

    c




xk
wij xk  
q v x
c  i,j 
  ij  ij k

[3.1.3.47]
c is an index over all experimental conditions. i and j are indices over the entries in the Q matrix
corresponding to non-zero rates in the kinetic model.
In order to calculate expression [3.1.3.47], the derivative of Q w.r.t. qij is necessary. Let
the matrix Qij’ be this derivative. Qij’ has the properties:
qij'kl  0
qij'kl  1
if k  i  and l  j 
if k  i  and l  j 
[3.1.3.48]
qij'kl  1 if k  i  and l  i 
The derivative of the log-likelihood function with respect to a macroscopic rate qijc is:
T

LL P0


B

β
0
0 
qijc qijcc0
73
T 1

t 0
t 1c
  T A t
 
α t  c  B t 1  β t 1 
qij


[3.1.3.49]
The first term in the right-hand side is non-zero only for the index c corresponding to the
experimental conditions that give the starting probabilities. If P0 is fixed (user-input), the term is
zero for any c. The term inside the summation is non-zero only if the experimental conditions at
time t are the same as those indexed by c. The consequence is that the computational cost for
expression [3.1.3.49] remains constant regardless of the number of different experimental
conditions, since the same number of non-zero terms (T+1) are to be calculated.
The derivatives of the macroscopic rate constant qij w.r.t. a constrained parameter are:
q ij
vij
qij
wij
 q ij
[3.1.3.50]
 qij  Vij
[3.1.3.51]
The derivative of a constrained parameter w.r.t. to a free parameter xk is:
vij
xk
wij
xk
 cA ijv ,k
[3.1.3.52]
 cA ijw,k
[3.1.3.53]
As mentioned, the dfpmin optimizer provides variance estimates for the optimized, free
parameters x. Variance estimates of the pre- and exponential factors, k0 and k1, can be derived as
follows (Qin et al, 2000a):
74
2

 dk ij0 dvij  
 
Var k    Var xk   

 dvij dx k  
k 

 

[3.1.3.54]
2

 dkij1 dwij  
 
Var k    Var xk   

 dw dx  
k 
k 
 ij


[3.1.3.55]
0
ij
1
ij
 




Var kij0   Var xk   kij0  cA ijv ,k

2
[3.1.3.56]
k
 
Var kij1   Var xk   cA ijw,k

2
[3.1.3.57]
k
3.1.4 Speeding up the computation
The algorithm as presented could be applied to raw data, in which case it
becomes quasi-identical to MPL (Qin et al, 2000a). Alternatively, it can be applied to idealized
data, by replacing the emission probabilities in the Bt matrix with 0’s and 1’s: 1 when the class of
the idealized data point at time t is the same as the class of the entry in the matrix, and 0
otherwise. The 0 and 1entries in the Bt matrix have the effect of selecting a single class sequence
out of all possible sequences. For this situation, I propose a way of speeding up the calculation,
by pre-computing the multiplications in the likelihood function. For as long as the class of the
data points remains unchanged, the likelihood function and its derivative have a multiplication of
identical terms. For example, K consecutive data points of the same class have identical B and A
matrices:
L  At 1  Bt 1  At 1  Bt 2  At 1  Bt 3 At 1  Bt K 
This is the same as multiplying with one term raised to the Kth power:
75
[3.1.4.1]
L  At 1  Bt  
K
[3.1.4.2]
The Kth power can be written as a product of terms raised to smaller powers, with the condition
that these powers sum up to K:
At1  Bt K  At1  Bt k  At 1  Bt k
1
M
k
l 1
l
2
At 1  Bt  M
K
k
[3.1.4.3]
[3.1.4.4]
The powers of the individual terms can be chosen to be themselves powers of two, e.g. 1, 2, 4,
8… etc. This choice has the consequence that a term raised to any even power is equal to the
product of two terms raised to half that power:
At1  Bt 2k  At1  Bt k  At 1  Bt k
[3.1.4.5]
It is therefore enough to calculate the term (At-1Bt) and, in the way of expression [3.1.4.5], its 2nd,
4th, 8th… powers, up to a given maximum. Obviously, the calculation should be done for each
conductance class. Suppose, for example, that there is a dwell 73 points long (strictly speaking,
not a dwell, but 73 consecutive points of the same class). Instead of 73 multiplications, 3 are
enough:
At1  Bt 1  At1  Bt 2 At1  Bt 73  At 1  Bt 64  At1  Bt 8  At 1  Bt  [3.1.4.6]
76
The three powers in the right hand side of the above equation are pre-calculated. This results in a
considerable speed improvement. Care must be taken though to avoid numerical under- or
overflow, which may occur if too many terms are multiplied before normalization. The speed-up
treatment applies to the log-likelihood function as well as to its derivatives. Obviously, for the
pre-multiplications to be valid, the A matrix must be constant in all terms involved.
I mentioned in the introduction to this chapter that with very noisy data and/or very fast
kinetics the likelihood function does not peak strongly for a particular class sequence. Hence,
choosing a single sequence over others results in significant errors. A threshold on state
assignment probabilities could be defined, say 0.9. If any one state i has a probability above the
threshold, then the algorithm could use the idealization, otherwise it uses the raw data. This mix
of algorithms provides a useful compromise between computation speed and accuracy. No results
of this kind of optimization are yet available.
3.2 Results
Since MIP is a close kin of the MPL algorithm, an exhaustive evaluation of its
performance and characteristics is not necessary. Instead, in Chapter 4 I use MIP with baseline
correction, side by side with MIL and MPL. In Chapter 5, on applications of infinite state-space
Markov models to single molecule data, I modify MIP in order to apply it to single molecule
fluorescence data from kinesin. In Chapter 6, the same parameterization of the transition rates,
linear constraint formalism, form of the log-likelihood function derivatives for stationary and
non-stationary data and other computational aspects are used for analysis of ensemble Markov
models applied to macroscopic patch-clamp recordings.
77
3.3 Discussion
The results in Chapter 4, and further results that are not shown here indicate that the
method I proposed is valid. Not surprisingly, its estimates are nearly identical to those obtained
with MPL, when the data has a good signal-to-noise ratio and the likelihood function is peaked
for one class sequence. For higher noise, the performance of the idealization procedure itself
degrades. At SNR ≤ 1 the degradation is so extreme that only MPL, i.e. raw data, can be used.
The missed event correction in MIL can be pushed further than the minimum of one
sampling interval, in which case the missed events caused by over-filtering are, to some extent,
recovered. In practice though, the choice of the dead time parameter for missed events caused by
over-filtering is subject more to the user’s experience than to a rigorous formulation, as it is
nonlinear and depends on unknown factors, such as the frequency response of the patch clamp
amplifier. Limited as it is, the missed event correction is remarkably good for most practical
situations.
When the data has a very low signal to noise ratio, the re-estimation/idealization
algorithm (whether Segmental k-means or Baum-Welch) makes errors of a different kind. When
the Gaussian conditional emission probabilities start to overlap, inevitably both false positive and
false negative classification errors occur. This is different from the over-filtering case, which
results only in false negative errors. MIP does not have and does not require a missed event
correction. When over-filtering is suspected to result in idealization errors with standard
algorithms, re-estimation/idealization methods suitable for correlated noise can be used (Qin et al,
2000b), or more simply decimation to the Nyquist limit. When high noise is the source of
idealization errors, MIP performs similar to MIL. Very fast kinetics push MIL to the limits of it
first-order missed events correction, but it does not seem to affect either MIP or idealization with
Baum-Welch (results not shown).
78
4. Signal detection and baseline correction of single channel records
Both discrete and continuous time Markov models can be used for kinetic analysis of
single channel records, as explained in Chapter 3. In addition to kinetics, the current amplitude
and the state noise characteristics of each conductance class are of interest. Conductance
parameters can be estimated with discrete time hidden Markov models. Baum-Welch (Baum et
al, 1970) and Segmental k-means (Rabiner et al, 1986) algorithms estimate simultaneously the
maximum likelihood kinetics and conductance parameters. Both algorithms include a state
inference step, performed respectively, by Forward-Backward (Baum et al, 1970) and Viterbi
(Viterbi, 1967; Forney, 1973). An optimal sequence of conductance classes is obtained as a byproduct of the state inference step. The reconstruction of a continuous sequence of dwells from
digitized, noisy data is a procedure known as data idealization. From the idealized data,
algorithms based on continuous time Markov models estimate maximum likelihood kinetic
parameters (Colquhoun et al, 1996; Qin et al, 1996).
Data idealization and conductance parameter estimation generally rely on ideal data
properties: Gaussian and non-correlated noise, stationary conductance parameters and a steady
baseline. Existing methods can deal with correlated noise and filtering effects (De Gunst et al,
2001; Qin et al, 2000b; Venkataramaran et al, 1998a, b, 2000) but the presence of baseline
artifacts is a more problematic issue. Examples of baseline artifacts are: slow drift caused by
changes in electrode or membrane potential, changes in patch or seal properties, sinewave
interference from power lines, or other stochastic fluctuations of unknown nature. Non-stationary
baseline affects the accuracy of estimates and, in extreme cases, may prevent any kind of
analysis.
The problem of deterministic baseline artifacts in single channel records has been
considered before, and correction methods proposed (Sachs et al, 1982; Chung et al, 1991;
79
Baumgartner et al, 1998). In contrast, to my knowledge stochastic baseline fluctuations have
never been addressed. In fact, deterministic artifacts can be regarded as a limit case of stochastic
fluctuations. It is easy to see how all baseline artifacts may be interpreted as non-stationary
conductance parameters, i.e. changes in the mean current. I model baseline fluctuations, of any
nature, with a discrete time, continuous state, dynamic stochastic process, namely the linear
Gaussian model (LGM; see Chapter 2). The stochastic baseline process affects the emission
probabilities of the hidden Markov model which describes the ion channel signal.
In the following section, I discuss the relevance of hidden Markov models with stochastic
parameters, or with mixed states, to the baseline correction problem. Next, I describe the linear
Gaussian model of the stochastic baseline process. I introduce two novel methods for baseline
correction, each based on coupling algorithms for hidden Markov and for linear Gaussian models.
I test the resulting baseline correction methods with simulated and experimental single channel
data, contaminated with different types of artifacts. I determine the performance of the algorithms
in removing stochastic or deterministic (sinusoidal interference) artifacts and establish their
operational range, with regard to state noise and the magnitude of the ion channel signal relative
to baseline fluctuations. The performance is evaluated by comparing the kinetics and conductance
parameters estimated from non-contaminated data with those estimated from baseline-corrected
data.
4.1 Hidden Markov models with stochastic parameters or with mixed states
Hidden Markov models with stochastic parameters are interesting in two situations: i).
the physical process modeled has genuinely stochastic, time-variable parameters; ii). there is
another stochastic process which contaminates the experimental measurement. Ion channels with
modal kinetics are an example of the first class; these channels switch between different kinetic
80
schemes. The kinetic parameters change between different discrete values, for example as a result
of molecular interactions, such as phosphorylation (Swope et al, 1999). The changes in
parameters may be continuous as well, as indicated by ion channels with the so-called
“wanderlust” kinetics (Silberberg et al, 1996). Baseline fluctuations are an example of the second
class, in which the measurement of a hidden Markov model is contaminated by another stochastic
process. Although the measurement characteristics do not change in time as a stochastic process,
they can be interpreted as such. The result is a Markov model with stochastic emission
parameters.
When the stochastic parameters take discrete values, the overall process can be modeled
with another, larger Markov meta-model. The meta-model has discrete meta-states, which are
combinations of process and parameter states. In comparison, when the stochastic parameters take
continuous values, the overall process can be modeled either as a Markov model with stochastic
parameters or as a mixed-state (continuous and discrete) dynamic model. The choice of modeling
with stochastic parameters or with mixed states is determined, in practice, by which interpretation
has the most efficient problem-solving algorithm.
The economics and engineering literature abound in examples of mixed states models
(Shumway and Stoffer, 1991; Kim, 1994; Ghahramani and Hinton, 1996). While meta-models are
approached with the same Markov-based algorithms, models with mixed states are much less
amenable to analysis. Existing algorithms are Monte Carlo (Carter and Kohn, 1996), Bayesian
(Billio et al, 1998) or EM based (Logothetis and Krishnamurthy, 1999; Doucet and Andrieu,
1999), and are all very recent developments.
4.2 Baseline linear Gaussian model
81
For ion channels, the emission probabilities of the hidden Markov model are, ideally,
Gaussian distributions, with constant mean i and variance 2i, conditional on the state i:


p yt | xt  i   N i , i2 | yt
[4.2.1]
I modify this representation by replacing the constant conditional mean i with a time variable
conditional mean i,t:
 i ,t   i  bt
[4.2.2]
bt is the value of the baseline offset at time t. The new conditional emission probabilities are:


p yt | xt  i   N bt   i ,  i2 | yt
[4.2.3]
By definition, all states in the closed conductance class have mean equal to zero (i.e. they do not
conduct current):
i  0
i | i  closed class
[4.2.4]
The baseline stochastic process b is described by a linear Gaussian model, with process
equation:
bt  A B  bt 1   B t 1
82
[4.2.5]
Since the state vector b has a dimension of one (i.e. a scalar) then b, the state transition matrix
AB, and the process noise covariance QB are all scalars. Thus, AB is denoted by aB and QB by qB.
By definition of a baseline, a slow deterministic drift present in single channel records is
negligible on a short time scale. Thus it can be safely assumed that the state transition parameter
aB is equal to one, which corresponds to a slope of zero. I also assume that the process variance
qB is constant in time. Equation [4.2.5] simplifies to:
bt  bt 1  B
[4.2.6]
Equation [4.2.6] represents an unbiased, unidimensional random walk, with expected mean b0 and
variance qB. The connection with the experimental physical phenomenon is evident, although not
perfect. In [4.2.6], bt is free to take any value and to drift probabilistically up or down (although
its expectation remains always equal to b0). In reality, if slow drift is discounted, the baseline is
constrained to stay around the same approximate value in spite of local fluctuations, stochastic or
deterministic such as sinusoidal. This behavior is quite different from that of an unbiased random
walk, which is under no such constraint. The model baseline is memory-less: the next baseline
offset is a function of the current offset only. In comparison, the real baseline appears to be
memory-less too, but perhaps not to the same degree. An auto-regressive model of some (small)
order k would be a more accurate representation, since the next state of an auto-regressive model
is a function of the previous (k) states. Certainly, a higher-order more complex model would
better capture the characteristics of the real phenomenon but the mathematical complexity and the
computational costs may not be justified by the results. As will be demonstrated in the results
section, the simple model in equation [4.2.6] is capable of tracking large fluctuations, stochastic
or sinusoidal (and implicitly slow drift).
83
Since the emission probabilities are modified by the stochastic baseline process, the
Markov model may be regarded as having stochastic parameters, with continuous states.
Alternatively, the ion channel signal, contaminated with baseline fluctuations, may be described
by a stochastic model with mixed states: discrete and continuous. A graphical depiction of this
construct is shown in Figure [4.2.1]. The ion channel’s Markov discrete state xt depends
probabilistically only on the previous state xt-1. The continuous baseline state bt depends
probabilistically only on the previous state bt-1. The measurement yt is a probabilistic function of
the state xt and a deterministic function of the baseline bt, as in equation [4.2.4]. The
measurement equation of the mixed state process can be written as:
yt  bt   xt
[4.2.7]
 x is a random Gaussian variable, with mean and variance conditional on the Markov state xt.
t
 x is the source of measurement errors. Clearly, as indicated by expression [4.2.7] above, yt
t
depends deterministically on bt and probabilistically on xt.
xt-1
xt
xt+1
yt-1
yt
yt+1
bt-1
bt
bt+1
Figure [4.2.1]. Mixed-states stochastic model of the ion channel signal contaminated with stochastic
baseline fluctuations. Squares represent discrete states and circles represent continuous states. Dotted
arrows denote stochastic dependencies and continuous arrows denote deterministic dependencies.
Shaded symbols represent observed variables and clear symbols represent unobserved (hidden)
variables. xt is the hidden Markov model state (ion channel conformation) and bt is the linear
Gaussian model state (baseline offset).
84
The ion channel process and the baseline process are uncoupled, evolving independently
of each other as represented in Figure [4.2.1]. The state of one does not interfere in any way with
the state of the other. The measurement, however, is a function of both states, discrete and
continuous. The measurement couples the discrete (HMM) and the continuous (LGM) models for
the purpose of state inference, and thus for parameter estimation. If the Markov state sequence X
were known, the HMM signal could be subtracted from the observation sequence Y, and the
remainder signal (the baseline) could be processed with linear Gaussian algorithms. The converse
is also true: if the baseline were known, it could be subtracted from the observation sequence and
the remainder signal (the ion channel) could be analyzed with hidden Markov methods.
Figure [4.2.2] shows an example of baseline fluctuations generated with equation [4.2.6],
with process variance qB = 0.0001. bt is drawn as a function of time. It is interesting to notice the
quasi-identical appearance of data at different time scales, a characteristic of diffusional
processes.
Figure [4.2.3] presents power spectra of data generated with a Markov model, a linear
Gaussian model and non-correlated (white) Gaussian noise. The Markov spectrum is a sum of
Lorentzians, parameterized by the eigenvalues of the Q matrix. The LGM spectrum is of 1/f type,
which explains the time domain invariance with respect to time scale. The white Gaussian
spectrum is uniform at all frequencies.
85
A
1 pA
50 ms
B
0.1 pA
B
2 ms
Figure [4.2.2]. Data simulated with a linear Gaussian model, with process variance qB 0.0001, and no
measurement error. The inset in A is expanded in B.
86
MM
1
0
-1
-2
0
1
2
3
log10 (Frequency [Hz])
4
0
1
2
3
log10 (Frequency [Hz])
4
0
1
2
3
log10 (Frequency [Hz])
4
Gaussian
state noise
B
C
-2
-3
-4
-5
4
LGM
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
A
2
0
-2
-4
D
1
100.0
50.0
2
5000.0
2000.0
3
Figure [4.2.3]. Power spectra of dynamic models and of Gaussian white noise. A Spectrum of Markov
model (no measurement noise), with open class amplitude 1.0. B Spectrum of white Gaussian noise,
with standard deviation 0.1. C Spectrum of linear Gaussian model, with process standard deviation
0.01 (qB 0.0001). D Markov model used for simulation of data analyzed in A. All data have one
million points and are sampled at 100 kHz.
87
4.3 Algorithms for baseline correction
The signal and the noise components commonly present in a patch clamp record are
summarized in Figure [4.3.1]. The three components, the ion channel signal, the background
noise and the baseline fluctuations are considered independent and hence additive in an RMS
sense. By definition, the states in the closed conductance class have no intrinsic noise, while the
open states should have a certain amount of excess noise. Various noise sources in the recording
chain, typically arising from the membrane patch and the amplifier, provide the wideband
background noise. This noise is Gaussian, and approximately white, although having a high
frequency boost followed by a bandlimiting filter roll-off (Venkataramanan et al, 1998a, b, 2000).
The baseline fluctuations are either stochastic or deterministic (e.g. slow drift with sinusoidal
interference).
The goal of baseline correction is to eliminate the baseline fluctuations and preserve the
other components. A perfectly deterministic corrupting signal can, in principle, be removed
without any error. However, when separation of two stochastic signals is attempted, false positive
and false negative classification errors occur. Therefore, removal of stochastic baseline artifacts
from another stochastic signal (the noisy hidden Markov model of the ion channel) can not be
perfect. The baseline removal procedure is subject to statistical errors and it will affect the
estimated Markov signal, the state excess noise and the background noise. For example, an
abrupt, upward change in baseline can be mistaken for a channel opening, thus modifying the
apparent kinetics. The variance observed in the data (after the Markov signal is subtracted) must
be partitioned between state excess noise and background noise, on one hand, and baseline
fluctuations (stochastic or not) on the other hand.
88
Ion channel
signal
Background
moise
excess
noise
+
additive white
Gaussian noise
+
linear Gaussian model
Patch clamp
record
Baseline
fluctuations
hidden Markov model
Figure [4.3.1]. Signal and noise components in a patch clamp record.
Even when there are no baseline fluctuations, a correction algorithm a fortiori
underestimates the variance of the state noise and the background noise, and overestimates the
variance of the baseline. The problem is complicated even more by the non-stationary character
of the artifacts model, the parameters of which are, at best, only approximations. Thus, no
baseline correction algorithm should be expected to perform flawlessly but, within a certain
signal-to-noise range, it should be robust and not introduce large errors. If the amplitude of
wideband background fluctuations remains well within the state amplitudes throughout the
record, standard state inference procedures, like Forward-Backward or Viterbi work acceptably
well. However, the variance estimation will be unreliable for the reasons discussed above. When
the signal to noise is low, standard algorithms fail and the kinetics cannot be solved.
89
The ion channel signal contaminated with baseline fluctuations can be modeled with a
dynamic model with mixed states (discrete and continuous). Although it is certainly possible to
compute the mixed states using the Bayesian theory, it is an intractable, combinatorial
optimization problem (Tugnait, 1982). Suppose the discrete state sequence (of the ion channel) is
known. Then the continuous state sequence (of the baseline) could be estimated optimally with a
Kalman filter-smoother. In contrast, if the discrete states are not known, the continuous sequence
has to be estimated for each of the NST possible discrete sequences (where NS is the number of
discrete states and T is the length of data). Likewise, if the continuous state sequence were
known, the discrete sequence could be optimally estimated with the hidden Markov filtersmoother (Forward-Backward). In contrast, if the continuous states are unknown, the discrete
sequence has to be estimated by integration over the continuous state space, with one integral sign
for each data point.
The intractability can be illustrated in a different way. Suppose that at time 0, the Markov
and the baseline states are, respectively, X0 and b0, where X0 is a state probability vector and b0 is
a Gaussian probability distribution with mean b0 and variance V0 (see Chapter 2, sections 2.1.1
and 2.1.2). The mixed state M0 is specified as the discrete-continuous pair:
M 0  X0 , b 0 
[4.3.1]
Bayesian estimation of the mixed state (MAP state estimation) involves the propagation of a
probability density. The discrete and continuous components of the mixed state can be predicted
independently from the previous state, for any t, as follows:
predicted discrete state:
~ T
T
Xt 1  Xt  A
[4.3.2]
predicted continuous state:
~
bt 1  N bt ,Vt  qB 
[4.3.3]
90
To simplify notation, I write xt=i as xti, for any t. The marginal probability distribution of
b0, conditional on the measurement y0, is obtained by marginalization with respect to each
possible discrete state i:
NS

pb0 | y0    p b0 , x0i | y0

[4.3.4]
i 1


p b0 , x0i | y0 


 
p y0 | b0 , x0i  pb0   P x0i
p y0 
[4.3.5]
All the terms in the right-hand side of expression [4.3.5] are Gaussian, except P(x0i) which is a
simple number. Thus, the joint probability p(b0,x0i|y0) is also a Gaussian. Then p(b0|y0) is a sum of
NS Gaussians, each multiplied with a weighting factor P(x0i).
The marginal probability distribution of b1, conditional on the measurement y1, is
obtained like b0, as follows:

NS
pb1 | y1    p b1 , x1i | y1

[4.3.6]
i 1


 
p y1 | b1 , x1i  pb1   P x1i
p b1 , x | y1 
p y1 

i
1

[4.3.7]
But the prior p(b1) is a projection of p(b0|y0), which is a sum of NS Gaussians. Hence,
p(b1) is also a sum of Gaussians, one for each possible state of X1. Thus p(b1,x1i|y1) is also a sum
of NS Gaussians and, consequently, p(b1|y1) is a sum of NS*NS Gaussians. By induction, the
marginal probability distribution of bt, conditional on the measurement sequence {y0,…,yt} is a
sum of NSt Gaussians. Clearly, the conditional probability space of b grows exponentially with
91
the sequence length. In contrast, for models with either discrete or continuous states only, the
conditional space growth does not occur. Although a tractable solution to the state inference
problem for mixed states models is not possible, approximate methods exist (Boyen and Koller,
1998; Frey and MacKay, 1997; Koller et al, 1999; Lauritzen, 1992; Minka, 2001; Murphy et al,
1999; Opper and Winther, 1999; Shachter, 1990; Yedidia et al, 2000).
I present here two new algorithms which I refer to, respectively, as Viterbi-Kalman and
Forward-Backward-Kalman. The first finds the mixed state sequence which maximizes the
complete data likelihood of the discrete state sequence X, the continuous state sequence b and the
data sequence Y, using the Viterbi algorithm and the Kalman filter. The second is an ExpectationMaximization algorithm which finds a maximum likelihood baseline sequence, interpreted as the
continuous, stochastic parameters of a hidden Markov model. The expectation step is performed
by a modified Forward-Backward procedure, and the maximization step is performed by a
modified Kalman filter-smoother. The algorithm has an additional step for parameter
initialization, which is an issue not addressed in a similar algorithm (Logothetis and
Krishnamurthy, 1999).
The two methods work well even for large amplitude fluctuations, and can correct
stochastic, as well as deterministic artifacts (slow drift and sinusoidal interference). The first
method computes a filtered state estimate while the second computes a smoothed estimate. Thus,
not quite surprisingly, the second method works better under high noise or heavy baseline
fluctuations but it is slower than the first. Both methods give reliable state variance estimates.
4.3.1 Viterbi-Kalman algorithm
For a hidden Markov model, the Viterbi algorithm finds the most likely sequence of
states, by maximizing the complete data likelihood:
92
X ML  arg max  pY, X 
[4.3.1.1]
X
For a mixed state process, one would like to find the most likely sequence of mixed
states, by maximizing the complete data likelihood of the discrete state sequence X (ion channel),
the continuous state sequence b (baseline) and the data sequence Y:
X, bML  arg max  pY, X, b
[4.3.1.2]
X ,b
Since X and b evolve independently, the complete data likelihood expression can be written as:
pY, X, b  pY | X, b  PX  pb
[4.3.1.3]
T
pY, X, b   p y0 | x0 , b0   Px0   pb0    p yt | xt , bt   Pxt | xt 1   pbt | bt 1 
[4.3.1.4]
t 1
The quantities in the right-hand side of expression [4.3.1.4] are as described previously, except
p(y|x,b), which is the conditional Gaussian probability density of observing data point y, given
that the channel is in state x and the baseline offset is b:


p y | x, b  N  x  b, x | y
2
[4.3.1.5]
The previous section has demonstrated that Bayesian mixed state inference is intractable,
because forward propagation of the continuous state probability density needs NSt Gaussians at
time t, for NS discrete states. However, I show in the following that obtaining the most likely
93
sequence of discrete and continuous states, by maximizing the complete data likelihood, is a
tractable problem, by way of a certain approximation.
Suppose that at time 0, the mixed state, M0, is specified as discussed, with a discrete state
probability vector X0, and a continuous state Gaussian probability density b0, with mean b0 and
variance V0:
M 0  X0 , b 0 
[4.3.1.6]
Expression [4.3.1.6] gives the prior distribution of X0 and b0, i.e. in the absence of the
measurement y0. To simplify notation, I write xt=i as xti, for any t. Given y0 and x0=i, the
conditional posterior of b0 can be calculated as follows:


p b0 | y0 , x0i 

p b0 , y0 , x0i
p y0 , x0i



 
 



[4.3.1.7]



p y0 | b0 , x0i  p b0 , x0i
p y0 , x0i


p y0 | b0 , x0i  pb0   p x0i
p y0 | x0i  p x0i
p b0 | y0 , x0i 
p b0 | y0 , x0i 



p b0 | y0 , x0i 

  

 
p y0 | b0 , x0i  pb0 
p y0 | x0i


[4.3.1.8]
[4.3.1.9]
[4.3.1.10]
The measurement Y is a function of both X and b, conform to [4.3.1.5]. When X is known, its
contribution to Y can be eliminated, i.e. subtracted from Y. Hence, expression [4.3.1.10] can be
written more simply as:
94


p y 0i | b0  pb0 
p b0 | y 
p y 0i

i
0


 
[4.3.1.11]

y0i  y0 | x0i  y0   i
[4.3.1.12]


The terms in the right hand side of expression [4.3.1.11] are Gaussian, thus p b0 | y 0i is also a
Gaussian, optimally calculated by the Kalman filter:



p b0 | y 0i  N b0i ,V0i

[4.3.1.13]
I denote by 0i the complete data likelihood of the mixed state sequence and data
sequence, up to time t=0. 0i is calculated as follows:


 0i  max py0 , b0 , x0i 
[4.3.1.14]
 0i  max py0 | b0 , x0i  pb0   Px0i 
[4.3.1.15]
 0i  Px0i  max py0 | b0 , x0i  pb0 
[4.3.1.16]
b0
b0
b0
Clearly, [4.3.1.16] is maximized by the same b0 as obtained in [4.3.1.13], i.e. b0i:
 0i  p y0 | b0i , x0i  pb0i  Px0i 
[4.3.1.17]
 0i  p y0i | b0i  pb0i  Px0i 
[4.3.1.18]
I denote by 1j the complete data likelihood of the mixed state sequence and data
sequence, up to time t=1. 1j is calculated as follows:
95


1j  max py0 , b0 , x0i , y1 , b1 , x1j 
b0 ,b1 ,i
[4.3.1.19]
1j  max py1 | b1 , x1j  pb1 | b0   Px1j | x0i  py0 | b0 , x0i  pb0   Px0i  [4.3.1.20]
b0 ,b1 ,i
1j  max py1j | b1  pb1 | b0   Px1j | x0i  py0i | b0  pb0   Px0i 
[4.3.1.21]
b0 ,b1 ,i
Generally, calculating tj is not tractable, since tj is a function of 2t-1 parameters. However, an
approximation can be used, that allows a recursive calculation of tj. Thus, tj can be
approximated as follows:


 t j  max  Pxtj | xti1  max pytj | bt  pbt | bti1    ti1 

i
bt
[4.3.1.22]

As with Viterbi, the previous discrete state is stored in the tj variable:
 t j  arg max  Pxtj | xti1  max pytj | bt  pbt | bti1   ti1 
i


bt
[4.3.1.23]
The maximization with respect to bt is performed with a Kalman filter, which gives


p bt | ytj as a Gaussian:



p bt | ytj  N bt j ,Vt j
96

[4.3.1.24]


To calculate p bt | ytj , the standard Kalman forward expressions [2.3.2.3.4] to [2.3.2.3.8] are
used, with two modifications: the measurement yt is replaced with ytj, and the measurement
variance rt is replaced with j2. Thus, bt+1j and Vt+1j are computed as follows:


ytj1  yt 1 | xtj1  yt 1   j

[4.3.1.25]

rt j 1  rt j 1 | xtj1   2j
[4.3.1.26]
~
bt j1  bˆti
[4.3.1.27]
~
Vt j1  Vˆt i  q B
[4.3.1.28]

~
~
K t j1  Vt j1  Vt j1  rt j 1

1
[4.3.1.29]

~
~
bt j1  bt j1  Kt j1  ytj1  bt j1

[4.3.1.30]
~
~
Vt j1  Vt j1  K t j1 Vt j1
[4.3.1.31]
Naturally, in order to prevent numerical under- or overflow, the logarithm of  is used
instead:
 





ln tj  max ln P xtj | xti1  ln p ytj | bt j  ln p bt j | bti1  ln ti1
i

 t j  arg max ln Pxtj | xti1   ln pytj | bt j   ln pbt j | bti1   ln ti1 
[4.3.1.32]
[4.3.1.33]
i
In addition to  and , Viterbi-Kalman stores at any time t, for each discrete state i, the
conditional baseline bti. The optimal mixed state at time T (last data point) is that which
terminates the mixed state sequence with the highest complete data likelihood. From time T, the
97
most likely sequence of mixed states is reconstructed backwards, down to the first data point,
using the  variable:
 
xT  arg max  Tj
[4.3.1.34]
j
xt   txt11
[4.3.1.35]
bt  btxt
[4.3.1.36]
Vt  Vt xt
[4.3.1.37]
The Viterbi-Kalman algorithm can be summarized as follows:
calculate 0i
initialize:
for time t = 0 to T-1 do
for each xt+1=j do
for each xt=i do
calculate (bjt+1, Vjt+1) with a Kalman filter
calculate jt+1
end i
choose i that maximizes (jt+1 | it)
assign jt+1 := jt+1 | it
assign jt+1 := i
assign (bjt+1, Vjt+1) := (bjt+1, Vjt+1) | (bit, Vit)
end j
end t
j := arg maxj(jT)
98
xT := j, bT := bTj, VT := VTj
for time t = T – 1 downto 0 do
j := t+1x
t+1
xt := j, bt := btj, Vt := Vtj
end t
It must be noted that the most likely mixed state sequence obtained by the ViterbiKalman algorithm is only approximate, in the sense that it is a filtered and not a smoothed state
estimate. Unfortunately, an exact solution is not tractable, but, if one is willing to accept higher
computational costs, Viterbi-Kalman can be improved, in the sense of a fixed-lag smoother.
4.3.2. Forward-Backward-Kalman algorithm
The Forward-Backward-Kalman (FBK) algorithm operates with a hidden Markov model
(the ion channel) with stochastic, continuously-valued emission parameters (the baseline offset).
The emission parameters are modeled with a linear Gaussian model. FBK uses the ExpectationMaximization mechanism to perform Markov state inference, followed by maximum likelihood
estimation of the stochastic parameters. It is very similar, in fact, to Baum-Welch: given the
current parameter estimates (baseline offset), the state sequence is inferred, using a modified
Forward-Backward procedure. From the inferred state sequence, the parameters are re-estimated
(new baseline offset), using the Kalman filter smoother. The iteration is repeated until some
convergence criterion is reached.
I denote by  the enlarged parameter set of the hidden Markov model with stochastic
emission parameters.  consists of the hidden Markov parameters , complemented by the
baseline b, and by the baseline process variance qB:
99
ψ  θ,b, qB 
[4.3.2.1]
The EM Q function can be written in terms of the  parameters:
Qψ , ψ'    ln px , Y | ψ   px | Y , ψ' 
[4.3.2.2]
X
Ideally, all of the parameters that characterize  must be estimated, by maximizing the Q
function. Unfortunately, this is not possible, for several reasons. The hidden Markov model is a
rather accurate representation of the ion channel physical process. In contrast, the simple linear
Gaussian model used is only an approximation of the baseline process. The fluctuations are not
entirely stochastic and do not have stationary characteristics throughout the length of data, which
prevents re-estimation of the baseline process variance qB.
The Q function is likely to have a surface with many local maxima, since it depends on
an enlarged parameter set , with stochastic components (i.e. b). Thus, if the initial baseline is
not accurate enough, the algorithm is likely to get trapped in a local, erroneous maximum. This
makes the initialization of the baseline sequence a critical issue.
Preliminary tests have indicated that sequential parameter re-estimation gives better
results. Thus, for a given , a maximum likelihood estimate of b is obtained. Then, for the b
estimate, a maximum likelihood of  is obtained. The two steps are iterated until convergence is
reached. qB is not re-estimated. Each step is an EM algorithm in its own right, which estimates
only half of the parameter set , conditional on the other half. The procedure can be described in
pseudo-code as follows:
100
initialize
select b0 and 0
repeat
b-step: estimate bi+1 given i
-step: estimate i+1 given bi+1
until convergence test
This sequential parameter re-estimation does not necessarily guarantee the same convergence
properties of EM. Despite this departure from the theoretical EM framework, practical application
of the algorithm indicates good behavior.
The Forward-Backward-Kalman algorithm corresponds to the b-step in the above
scheme. If the estimated baseline sequence b is subtracted from the data sequence Y, the -step
can be performed with Baum-Welch, and it is not discussed here. FBK can be succinctly
described as follows:
initialize:
select an initial baseline sequence b0
repeat
expectation:
calculate Q(bi+1,i) with modified Forward-Backward
maximization: re-estimate bi+1 with modified Kalman filter-smoother
until convergence test
The expectation step is performed with a modified Forward-Backward procedure,
conditional on the current baseline and  parameters. The sufficient statistics are calculated (’s,
’s and ’s). The maximization step, performed by a modified Kalman filter-smoother, uses the
sufficient statistics to calculate a maximum likelihood parameter sequence, i.e. a new baseline.
The  parameters of the hidden Markov model (initial state probabilities, transition and emission
101
probabilities) are not changed during the iteration. Since convergence of the algorithm is not
certain, the convergence test could be, for example, the reaching of a maximum number of
iterations. In the following sections, I explain each step in detail.
4.3.2.1 Initialization
Initialization of the baseline is important for avoiding local maxima. I propose here an
initialization method based on the Expectation-Maximization framework. For each time t,
expectation and maximization steps are iterated until convergence. The expectation step infers the
current Markov state from the previous Markov state (which plays the role of initial state
probability vector), conditional on the current baseline estimate. The maximization step calculates
a maximum a posteriori (MAP) estimate of the baseline offset, conditional on the previous
baseline offset and on the current inferred Markov state. The MAP baseline estimate is obtained
with a Kalman filter. The estimate is maximum a posteriori and not maximum likelihood because
its computation involves the previous baseline offset, which is regarded as a parameter prior. This
MAP EM procedure is applied sequentially to each data point, in the forward direction.
Suppose that at time t, the Markov state probability vector is Xt, and the baseline offset is
a Gaussian probability density, with mean bt and variance Vt. At time t+1, the Markov state and
the baseline offset can be predicted independently, as follows:
predicted Markov state:
~ T
T
Xt 1  Xt  A
[4.3.2.1.1]
predicted baseline offset:
~
bt 1  N bt ,Vt  qB 
[4.3.2.1.2]
To simplify notation, I write xt=i as xti, for any t. The Forward-Backward procedure, as
applied to one data point only, calculates the values of  and  from the initial state probability
102
vector, the current baseline offset and the measurement. For any iteration, the initial state
probabilities are equal to the predicted state probabilities in equation [4.3.2.1.1]. The current
baseline offset is initialized for the first iteration as the predicted baseline in equation [4.3.2.1.1],
and for any other iteration is equal to the MAP estimate from the previous iteration. At time t+1,
these quantities are calculated as follows:
 ti1  ~
xti1  p yt 1 | xti1 , bt 1 

i
t 1

 ti1
NS

i 1

[4.3.2.1.3]

[4.3.2.1.4]
i
t 1


p yt 1 | xti1 , bt 1  N i  bt 1 , i2 | yt 1
[4.3.2.1.5]
The MAP baseline at t+1 can be obtained from the Q function and the baseline prior, as
follows:
btMAP
 arg max Qbt 1 , bt 1'   ln pbt 1 | bt 
1
[4.3.2.1.6]
bt 1
Since only the measurement part of the Q function depends on bt+1, the MAP baseline is:
 

 


btMAP
 arg max  ln p yt 1 | xti1 , bt 1  p xti1 | yt 1 , bt 1'  ln pbt 1 | bt 
1
bt 1
 i

[4.3.2.1.7]
The solution to the MAP estimation becomes more obvious, if the problem is reformulated. Thus, bt+1 can be interpreted as the state of a linear Gaussian model, with process and
measurement equations:
103
bt 1  bt  t
t  N 0, qB 
yt 1  bt 1  t 1
t 1  N    ti1  i ,   ti1   i2 
[4.3.2.1.8]
 NS
NS

 i1
i 1

[4.3.2.1.9]
In other words, the measurement Gaussian random variable is a weighted sum of NS random
Gaussian variables, each with mean i, variance i2 and weighting factor it+1. Each of these
random variables corresponds to one of the Markov states.
The MAP baseline estimate can be obtained with a Kalman filter, after a simple
modification is operated, by which the mean of the measurement probability density is subtracted
from the measurement yt+1. Hence, the mean of the measurement density becomes again zero,
whereas the variance of the measurement density, rt+1, is a weighted sum of the Markov
measurement variances at time t+1:

yt 1 :  yt 1 |  t 1   yt 1    ti   i

[4.3.2.1.10]
i




p yt 1 | bt 1   N  0,   ti1   i2 
 i

[4.3.2.1.11]
NS
rt 1   ti1   i2
[4.3.2.1.12]
i 1
For each time t+1, the baseline initialization method is summarized as:
initialize
assign b0t+1 := bt
repeat
expectation:
estimate i+1t+1 with Forward-Backward given bit+1 and t
104
maximization: estimate bi+1t+1 with Kalman filter given i+1t+1 and bt
until convergence test
i+1t+1 is the  quantity at time t+1, calculated in iteration i+1. For the calculation of i+1t+1,
the initial state probabilities vector is equal to , for t=0, and to t, for t>0. bi+1t+1 is the baseline at
time t+1, calculated in iteration i+1. t is the  quantity at time t, after convergence. bt is the
baseline at time t, after convergence. The procedure is simple and has indicated relatively fast
convergence in practical applications.
4.3.2.2 Expectation
The expectation step of the Forward-Backward-Kalman algorithm calculates the
sufficient statistics using the standard Forward-Backward procedure, with a few preparations:
yt :  yt | bt   yt  bt
[4.3.2.2.1]
 i2 :  i2 | Vt    i2  Vt
[4.3.2.2.2]
In effect, the baseline sequence b is subtracted from the observation sequence Y, and the variance
of the emission probabilities of each state is increased to account for the variance in the estimated
baseline. The latter is usually very small compared to the state-conditional measurement variance
and can be ignored, except maybe at both ends of the data sequence and possibly in areas with
unusual baseline changes.
4.3.2.3 Maximization
105
The maximization step estimates the most likely baseline sequence bML given the
relevant sufficient statistics from the expectation step, namely the ’s, using a standard Kalman
filter-smoother. As in the initialization part described above, a few preparations are necessary:

yt :  yt |  t   yt    ti   i

[4.3.2.3.1]
i

rt : rt |  t     ti   i2

[4.3.2.3.2]
i
In effect, the Markov signal is subtracted from the observation sequence, and the remainder is
smoothed with the Kalman filter-smoother to determine the most likely baseline offset sequence.
4.4 Results
In this section I test both algorithms: Viterbi-Kalman and Forward-Backward-Kalman.
Both have only one adjustable parameter: the variance qB of the baseline process. I refer to qB as
SimBaseVar in the context of simulating baseline fluctuations, and as BaseVar in the context of
baseline correction.
I have simulated data with different signal-to-noise ratios, with and without baseline
artifacts, and have compared the idealization and the conductance parameters produced by the
algorithm, with the true idealization and true conductance parameters. Thus, I was able to
evaluate the performance of the idealization/baseline correction algorithm in terms of state
sequence inference and conductance parameters estimation, and to assess the effects that the
method has on the estimated kinetic parameters.
Rigorously, the signal-to-noise ratio (SNR) is defined as the ratio between the power of
the signal and the power of the noise. As the true SNR is difficult to calculate, I prefer to use a
106
more operational definition, approximate but intuitive. Thus, I define the signal-to-noise ratio of
the ion channel signal as:
SNR 
AmpOpen  AmpClose
MeasStd Open
[4.4.1]
I chose the measurement variance corresponding to the open state since it is usually the
highest, due to state excess noise, and gives the SNR of the worst case scenario. Obviously,
defined as such, the SNR does not depend on the ion-channel kinetics, which is a crude
approximation. Also, this SNR is proportional to the square root of the true SNR (since the power
is proportional to the variance).
I varied the SNR of the simulated data between 10 and 0.5, with 10 corresponding to the
easiest and 0.5 to the most difficult case. I expected a low SNR to result in mistakes in the
idealized trace, leading to errors in the estimation of both conductance and kinetics parameters. I
did not go below SNR 0.5, for the practical reason that data with such low SNR will most likely
not be identified as containing any signal at all.
I have added to the simulated data two kinds of baseline artifacts, separately. The first
kind of artifact consists of fluctuations generated by a simple linear Gaussian model with no
measurement noise, with process equation:
bt  bt 1  B
[4.4.2]
The varied parameter is the variance of the process, qB, which I will refer to as
SimBaseVar. A zero value of SimBaseVar corresponds to no baseline fluctuations. A larger
value results in fluctuations of higher amplitude. The same seed of the random number generator
107
was used to generate the baseline in all experiments, so the shape but not the amplitude of the
fluctuations was identical, allowing for direct comparisons.
The second kind of fluctuations is a deterministic, periodic noise, in the form of a
sinewave with given frequency and amplitude, which I will refer to as SineAmp and SineFreq,
respectively. I used three values for SineAmp: 0 (corresponding to no sinewave interference), 0.5
(within the state amplitudes) and 1. For SineFreq I choose two values: 60 Hz and 600 Hz. The 60
Hz corresponds to interference caused by power lines. The sinewave acts in concert with the
sampling frequency, which was 100 kHz in all experiments. Since not all patch-clamp data is
sampled at 100 kHz, but possibly lower, I choose the 600 Hz frequency to mimic the situation of
a 10 kHz sampling and 60 Hz sinewave, with the advantage of keeping the data sequence
identical in all experiments. The maximum slope of the 600 Hz sinusoid is ten times larger than
that of the 60 Hz sinusoid. Intuitively, it is easy to see how contamination with a lower frequency,
relative to the sampling frequency, will be easier to correct. It must be noted that any kind of
added baseline artifact leads to a decrease in the effective SNR.
The amplitudes of the close and the open conductance classes, which I refer to as Amp0
and Amp1, were in all experiments 0 and 1, respectively. For convenience, I used identical state
variances for the two conductance classes, which I refer to as MeasVar to denote the simulation
parameter and Std0 and Std1 to denote an estimated parameter. I varied MeasVar of the
simulated data in order to change SNR.
The seed of the random number generator used by the simulator was the same in each
experiment, so the generated state sequence was the same, irrespective of the parameters used in
simulation and independent of the baseline artifacts. Direct comparisons of the tested algorithms
in terms of state inference, conductance and kinetics parameters estimation are thus possible. The
kinetics was estimated with three maximum likelihood methods: MIL (Qin et al, 1996, 1997),
MPL (Qin et al, 2000a) and MIP (presented in Chapter 3).
108
Experience shows that idealizing data with Viterbi is feasible only for a relatively good
SNR. For low SNR, not only is idealization with Viterbi less reliable, but estimating kinetics
from such idealized data (with MIL or MIP) results in gross errors. Forward-Backward and MPL
are required in these situations. Therefore, my first goal was to find the working SNR range for
Viterbi and for Forward-Backward. Thus, the most efficient kinetic estimation method is applied:
at high SNR, MIL or MIP runs fast with idealized data (obtained by either idealization
algorithm); at low SNR, MPL runs more slowly using raw data (but baseline corrected).
All data was simulated with the model in Figure [4.4.1] and the parameters in Table
[4.4.1].
1
100.0
50.0
2
5000.0
2000.0
3
Figure [4.4.1]. Model used for simulation
PtsPerSeg
SegCnt
1000000
1
Sampling
[kHz]
100
Amp1
0.0
Amp2
1.0
Table [4.4.1]. Simulation parameters
PtsPerSeg and SegCnt denote, respectively, the number of points in a data set, and the number of
data sets analyzed together.
I chose the simulation model so as to generate both fast and slow kinetics in the same
data set. Thus the need for separately testing slow and fast kinetics is eliminated. The simulated
data presents clusters of activity, separated by longer closed intervals.
The models used in idealization and kinetics estimation are shown in Figure [4.4.2]. I had
to use the asymmetric model (Figure 4.4.2, B) because with symmetry, the Baum-Welch re109
estimation could not distinguish between the two close states, getting stuck in a suboptimal
solution. Although Viterbi assigns equal initial probability to the two close states, only one is
chosen as the starting state. In contrast, Forward-Backward does not choose one single state for
the initial point, and so the re-estimated kinetics remains symmetric and suboptimal, at any
iteration. Instead of starting with a (slightly) asymmetric model, I could have used unequal
starting probabilities, with the same effect. In fact, it is generally good practice to start with
asymmetric models, therefore avoiding other unpleasant side-effects, such as multiple (repeated,
confluent) eigenvalues of the Q matrix.
The parameters for algorithms MIP, MIL and MPL are in Tables 4.4.2, 4.4.3 and 4.4.4,
respectively. Td is the dead-time correction in MIL. The other parameters are obvious.
A
1
100.0
100.0
2
100.0
100.0
3
B
1
100.0
100.0
2
200.0
200.0
3
C
1
1000.0
1000.0
2
Figure [4.4.2]. Models used in idealization and kinetics estimation. A Model used for idealization
and re-estimation with Segmental k-means and kinetics estimation. B Model used for idealization
and re-estimation with Baum-Welch. C Simplified model used for idealization.
110
MaxStep
0.2
MaxIter
50
LLConv
1e-8
GradConv
1e-8
Table [4.4.2]. MIP parameters
MaxStep
MaxIter
0.5
100
Td
[ms]
0.015
LLConv
GradConv
1e-4
1e-4
Table [4.4.3]. MIL parameters
MaxStep
0.5
MaxIter
100
LLConv
1e-4
GradConv
1e-4
Table [4.4.4]. MPL parameters
4.4.1 Idealization of simulated data without baseline artifacts
In this section I evaluate the performance of signal detection and parameter re-estimation
algorithms, with regard to the signal-to-noise ratio. The empirically established operational range
of each method indicates the conditions under which to apply it further to baseline correction. The
signal detection algorithms are Viterbi, Forward-Viterbi and Forward-Backward. The reestimation algorithms are Segmental k-means and Baum-Welch.
For testing signal detection, I use the same model as for simulation, starting with the true
kinetics and conductance parameters (Figure [4.4.2], A or B), or a simplified model (Figure
[4.4.2], C), with the true conductance parameters. The same strategy is applied to re-estimation.
The signal detection performance is quantified by the deviation of the rate constants
estimated from the resulting idealization with MIL and MIP algorithms, from the true rate
constants. Since the amount of simulated data is finite and the rate estimation methods are not the
subject of testing, the true rate constants are defined as those estimated from the simulated
111
idealization. These rates will be slightly different from the mean rates of the simulation model
due to the stochastic nature of the Markov signal. The re-estimation performance is assessed in
the same way, and additionally, from the deviation of the estimated conductance parameters from
the true values.
The results obtained with the different signal detection algorithms are presented in
Figures [4.4.1.2] and [4.4.1.3]. With the true model, Viterbi gives the worst results, producing a
very good idealization at SNR ≥ 5 and acceptable down to SNR 2. Below 2, Viterbi is unusable.
Forward-Viterbi is somewhat better but still not usable at or below SNR 1. In contrast, ForwardBackward gives acceptable idealization even at SNR 0.5. With a simplified C-O model, both
Viterbi and Forward-Backward (with a two-state model, Forward-Viterbi reduces to Viterbi) start
losing accuracy when the contribution of the kinetics to the likelihood function becomes
important, below SNR 5.
The results of the re-estimation algorithms are presented in Figures [4.4.1.4] and
[4.4.1.5]. The re-estimation of conductance parameters with Baum-Welch has an error of one to
two orders of magnitude smaller than with Segmental k-means. The idealization performance is
qualitatively similar to that of the signal detection procedure alone, but worse, since both Viterbi
and Forward-Backward have to work with more or less inaccurate parameters. In both cases MPL
performs better than either MIL or MIP at lower SNR, when the complete data likelihood of one
sequence of classes is no longer equivalent to the incomplete data likelihood of all possible
sequences. MIL performs slightly better than MIP due to the missed event correction capability.
I conclude that either Viterbi alone or Segmental k-means should be limited to SNR ≥ 5,
when they are more efficient in terms of computational speed, memory requirements and
convergence rate (results not shown). For lower signal to noise ratios, or when high accuracy of
conductance parameters is required, either Forward-Backward alone or Baum-Welch should be
employed, at the cost of lower efficiency. MIL can be safely used down to SNR 1 if the
idealization is obtained with Baum-Welch. Below SNR 1 only MPL works. At SNR ≥ 5, a
112
simplified model is enough for a quality idealization but a more accurate model should be used
beyond that, especially when the eigenvalues of the generating model span many orders of
magnitude. Based on these conclusions, I limit application of Viterbi-Kalman to baseline
correction and idealization of data with SNR ≥ 5, and of Forward-Backward-Kalman to data with
SNR < 5.
113
Segmental k-means
SNR 10
1 pA
0.5 pA
50 ms
Baum-Welch
1 pA
5 ms
2 pA
5 ms
2 pA
1 pA
SNR 5
5 ms
50 ms
2 pA
2 pA
SNR 2
5 ms
2 pA
5 ms
50 ms
5 pA
5 pA
SNR 1
5 pA
5 ms
50 ms
5 ms
5 ms
Figure [4.4.1.1]. Simulated data without baseline artifacts, at different SNR, idealized with Segmental k-means and Baum-Welch. At SNR
1, the signal is almost buried in noise. SKM makes considerably more errors than BW.
114
MIP
MIL
10000
10000
Viterbi
K12
K12
K21
1000
K23
K21
1000
K23
K32
K32
TrueK12
100
TrueK21
TrueK12
100
TrueK21
TrueK23
TrueK23
TrueK32
10
10
5
2
1
0.5
10
10000
ForwardViterbi
TrueK32
10
5
2
1
0.5
10000
K12
K21
1000
K23
K12
K21
1000
K23
K32
K32
TrueK12
100
TrueK21
TrueK12
100
TrueK21
TrueK23
TrueK23
TrueK32
10
10
ForwardBackward
TrueK32
10
5
2
1
0.5
10
5
2
1
0.5
10000
10000
K12
K12
K21
1000
K23
K21
1000
K23
K32
K32
TrueK12
TrueK12
6
100
TrueK21
100
TrueK21
TrueK23
TrueK23
TrueK32
TrueK32
10
10
10
5
2
1
10
0.5
SNR
5
2
1
0.5
SNR
Figure [4.4.1.2]. Kinetics estimation from simulated data at different SNR, without baseline artifacts.
Data idealized with Viterbi, Forward-Viterbi or Forward-Backward, using the true model. The
kinetics were estimated from the idealization with MIL or MIP.
115
MIP
MIL
10000
10000
K12
K12
Viterbi
K21
1000
K23
K21
1000
K23
K32
K32
TrueK12
TrueK21
100
TrueK12
TrueK21
100
TrueK23
TrueK23
TrueK32
TrueK32
10
10
10
5
2
1
0.5
10
10000
5
2
1
0.5
10000
ForwardBackward
K12
K12
K21
1000
K23
K21
1000
K23
K32
K32
TrueK12
TrueK21
100
TrueK12
TrueK21
100
TrueK23
TrueK23
TrueK32
TrueK32
10
10
10
5
2
1
0.5
10
SNR
5
2
1
0.5
SNR
Figure [4.4.1.3]. Kinetics estimation from simulated data at different SNR, without baseline artifacts.
Data idealized with Viterbi or Forward-Backward, with a simplified C-O model. The kinetics were
estimated from the idealization with MIL or MIP.
116
A
B
Segmental k-means
0.004
0.3
(Meas-True)/True
Baum-Welch
0.25
0.003
0.2
Amp0
Std0
0.15
Amp1
0.1
Amp0
0.002
Std0
Amp1
0.001
Std1
Std1
0
0.05
-0.001
0
-0.002
-0.05
10
5
2
1
10
0.5
5
2
1
0.5
Figure [4.4.1.4]. Re-estimation of conductance parameters, from simulated data at different SNR,
without baseline artifacts. Data idealized with Segmental k-means or Baum-Welch, using the true
model. Notice the different scale of the y axis, in A and B. Conductance parameters estimation is
much more accurate with BW than SKM.
117
Baum-Welch
Segmental k-means
10000
10000
K12
1000
MIL
K21
K21
K23
K23
K32
K32
TrueK12
100
K12
1000
TrueK21
TrueK12
100
TrueK21
TrueK23
TrueK23
TrueK32
10
TrueK32
10
10
5
2
1
0.5
10
10000
5
2
1
0.5
10000
K12
K21
MPL
1000
K21
K23
K23
K32
K32
TrueK12
100
K12
1000
TrueK21
TrueK12
100
TrueK21
TrueK23
TrueK23
TrueK32
10
TrueK32
10
10
5
2
1
0.5
10
10000
5
2
1
0.5
10000
K12
K21
MIP
1000
K12
1000
K21
K23
K23
K32
K32
TrueK12
100
TrueK21
TrueK12
100
TrueK21
TrueK23
TrueK23
TrueK32
10
TrueK32
10
10
5
2
1
0.5
10
SNR
5
2
1
0.5
SNR
Figure [4.4.1.5]. Kinetics estimation from simulated data at different SNR, without baseline artifacts.
Data idealized with Segmental k-means or Baum-Welch, using the true model. The kinetics were
estimated from the idealized data with MIL or MIP, or from the raw data with MPL (using the reestimated conductance parameters). Kinetics estimation from data idealized with BW is much more
accurate than with SKM.
118
4.4.2 Idealization and baseline correction of simulated data with stochastic baseline
fluctuations
The amplitude of the baseline fluctuations is determined by the process variance qB. I
have simulated data with increasing fluctuations amplitude, with SimBaseStd 0, 0.005, 0.01,
0.02, 0.05 and 0.1 (SimBaseStd is the square root of SimBaseVar). A SimBaseStd of 0.05 and
0.1 results in extreme baseline artifacts, which no single channel data is likely to suffer. The
appearance of the simulated data with SimBaseStd 0.005 and 0.01 resembles real data, with
enough baseline fluctuations to render the existing analysis algorithms unusable. Obviously, these
numbers are only meaningful with respect to the actual difference in conductance class
amplitudes (in this case 1 pA).
Figures [4.4.2.1] through [4.4.2.6] show the performance of the baseline correction in the
time domain. A low and a high resolution views are presented, selected arbitrarily from the data
file (the same selection for all figures). In each figure, the A view has the data with baseline
fluctuations and B has the data without baseline fluctuations (so-called true data). The other
views (C, D etc) have the baseline corrected data, with different methods and possibly different
values of BaseStd (the square root of BaseVar). The high resolution view for the true and
corrected data has overlapped the detected signal. Unless otherwise specified, the BaseStd
parameter for correction was the same as the SimBaseStd parameter for simulation.
Viterbi-Kalman is used only to correct data with SNR 5 (Figures [4.4.2.1] through
[4.4.2.4]). Forward-Backward-Kalman is used to correct data with SNR 5 and SimBaseStd 0.01
and below (Figures [4.4.2.2] through [4.4.2.6]). At SNR 5, VK makes practically no visible
mistakes up to SimBaseStd 0.05, and fails almost completely at SimBaseStd 0.1. FBK starts
making visible mistakes at SimBaseStd 0.05, by confounding the open class with the closed class
119
and vice-versa. However, data simulated with SimBaseStd above 0.05 is unlikely to match any
real data, and if it does, hopefully it lasts only for short segments.
Below SNR 5, the correction is more difficult to inspect, as the signal gets buried in
noise. The idealized data in the high resolution view indicates that FBK does a fine job at baseline
correcting and idealizing data. Relative to the Forward-Backward idealization of the true data, the
FBK idealization is very good up to SimBaseStd 0.05 at SNR 2 and SimBaseStd 0.02 at SNR 1.
Some false positives and false negatives are nevertheless present, with many more at higher
values of SimBaseStd.
The next set of figures, [4.4.2.7] through [4.4.2.10], shows the performance of the
baseline correction in the frequency domain. Of data at SNR 5, only the spectra for ViterbiKalman corrected data are shown, in Figure [4.4.2.7]. Of data at SNR 2 and 1, the spectra for
Forward-Backward-Kalman corrected data are shown in Figures [4.4.2.8] and [4.4.2.9],
respectively. Not surprisingly, the spectral results match their time domain counterpart. The
spectra of baseline corrected data are almost identical with the true data spectra, except a high
value of SimBaseStd (0.05). Figure [4.4.2.10] presents the effect of BaseStd in the frequency
domain. Simulated data at SNR 5, with SimBaseStd 0.02, is baseline corrected with ViterbiKalman and Forward-Backward-Kalman, using different values for the BaseStd parameter. VK
appears to be more sensitive, whereas FBK produces almost identical spectra. It also seems that
overestimation of BaseStd is worse than underestimation. More on this issue follows.
The next two figures, [4.4.2.11] and [4.4.2.12], show the re-estimation of conductance
parameters from baseline corrected data. Figure [4.4.2.11] shows that the re-estimation error is
approximately equal for Viterbi-Kalman and Forward-Backward-Kalman. Considering that
Baum-Welch has much smaller errors than Segmental-k means when used for data without
baseline artifacts, it can be inferred that the error is mostly due to the baseline fluctuations. It is
immediately apparent that within the operational range of each method the conductance
parameters are underestimated. The elimination of baseline fluctuations seems to have an effect
120
similar to that of a low-pass filter. The amplitude of the open class is lowered, and the state
conditional noise is reduced. Outside the operational range (at high SimBaseStd), the amplitude
of the open class becomes unpredictable, while the state noise first drops and then rises. The
overestimation of noise is explained by classification errors, as misclassified points result in
increased variance. Figure [4.4.2.12] shows the effect of BaseStd on conductance parameters.
Data simulated at SNR 5, with baseline fluctuations of SimBaseStd 0.02, is baseline corrected
and idealized with VK and FBK, with different values for BaseStd. Lower, as well as higher
values result in very slight underestimation of the amplitude. Lower values give about 5%
overestimation, while higher values result in about 10% underestimation of the state noise.
The last set of figures, [4.4.2.13] through [4.4.2.15], shows the estimation of kinetics
parameters from baseline corrected data. The four rate constants are plotted as a function of
SimBaseStd in Figures [4.4.2.13] and [4.4.2.14] and as a function of BaseStd in [4.4.2.15]. Both
the estimated and the true rate constants are shown, in a logarithmic plot. The true rates are
actually the rates estimated from data without baseline artifacts, using the corresponding
idealization (Segmental k-means and Baum-Welch) and rate estimation (MIL, MPL and MIP)
algorithm. A first conclusion is that MPL has better accuracy than MIL or MIP, irrespective of
the baseline correction method, SNR or amount of baseline fluctuations. MIL is only slightly
better than MIP, and mostly when associated with Viterbi-Kalman. With FBK-corrected data,
relatively accurate rates can be estimated from data with SimBaseStd as high as 0.05 for SNR 2
and 0.02 for SNR 1. At SNR 5, both VK and FBK are good up to SimBaseStd 0.05.
121
1 pA
A
B
1 pA
50 ms
2 ms
1 pA
1 pA
2 ms
50 ms
C
1 pA
1 pA
2 ms
50 ms
Figure [4.4.2.1]. Simulated data at SNR 5 with stochastic baseline artifacts at two different time resolutions. A Simulated data with
SimBaseStd 0.01. B True data with true idealization. C Data in A idealized and baseline corrected with Viterbi-Kalman.
122
A
1 pA
2 pA
50 ms
2 ms
2 pA
B
2 pA
2 ms
50 ms
2 pA
C
D
2 pA
2 ms
50 ms
2 pA
5 pA
2 ms
50 ms
Figure [4.4.2.2]. Simulated data at SNR 5 with stochastic baseline artifacts at two different time resolutions. A Simulated data with
SimBaseStd 0.02. B True data with true idealization. C Data in A idealized and baseline corrected with Viterbi-Kalman. D data in A
idealized and baseline corrected with Forward-Backward-Kalman.
123
10 pA
A
2 pA
50 ms
2 ms
2 pA
2 pA
B
2 ms
50 ms
2 pA
2 pA
C
2 ms
50 ms
2 pA
2 pA
D
F
2 ms
50 ms
2 pA
2 pA
E
G
50 ms
2 ms
Figure [4.4.2.3]. Simulated data at SNR 5 with stochastic baseline artifacts at two different time resolutions. A Simulated data with
SimBaseStd 0.05. B True data with true idealization. C Data in A idealized and baseline corrected with Viterbi-Kalman, with BaseStd
0.05. D Data in A idealized and baseline corrected with Forward-Backward-Kalman, with BaseStd 0.05. E Data in A, with ForwardBackward-Kalman, with BaseStd 0.02. F, G Example of FBK thrown off by a spike in the data. This could be avoided by slight filtering.
124
10 pA
A
2 pA
50 ms
2 ms
2 pA
B
C
D
2 pA
2 ms
50 ms
2 pA
2 pA
2 ms
50 ms
2 pA
2 pA
2 ms
50 ms
Figure [4.4.2.4]. Simulated data at SNR 5 with stochastic baseline artifacts at two different time resolutions. A Simulated data with
SimBaseStd 0.1. B True data with true idealization. C Data in A idealized and baseline corrected with Viterbi-Kalman, with BaseStd 0.05.
D Data in A idealized and baseline corrected with Forward-Backward-Kalman, with BaseStd 0.05.
125
5 pA
A
B
2 pA
2 ms
20 ms
2 pA
5 pA
2 ms
20 ms
5 pA
C
10 pA
2 ms
20 ms
2 pA
D
E
F
G
2 ms
2 pA
20 ms
2 pA
2 pA
2 ms
20 ms
2 pA
2 pA
2 ms
20 ms
2 pA
2 pA
2 ms
20 ms
Figure [4.4.2.5]. Simulated data at SNR 2 with stochastic baseline artifacts at two different time resolutions. A Simulated data with
SimBaseStd 0.01. B SBS 0.02. C SBS 0.05. Notice the change is amplitude scales. D True data, idealized with FB. E Data in A idealized
and baseline corrected with FBK. F Data in B, with FBK. G Data in C, with FBK.
126
5 pA
A
5 pA
2 ms
20 ms
5 pA
5 pA
B
2 ms
20 ms
5 pA
10 pA
C
2 ms
20 ms
5 pA
5 pA
2 ms
D
20 ms
5 pA
5 pA
E
2 ms
20 ms
5 pA
5 pA
F
2 ms
20 ms
5 pA
5 pA
G
2 ms
20 ms
Figure [4.4.2.6]. Simulated data at SNR 1 with stochastic baseline artifacts at two different time resolutions. A Simulated data with
SimBaseStd 0.01. B SBS 0.02. C SBS 0.05. Notice the change is amplitude scales. D True data, idealized with FB. E Data in A idealized
and baseline corrected with FBK. F Data in B, with FBK. G Data in C, with FBK.
127
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
8
6
4
2
0
8
4
2
0
1
8
2
3
log10 (Frequency [Hz])
6
4
2
0
0
1
2
3
log10 (Frequency [Hz])
6
4
2
0
0
1
2
3
log10 (Frequency [Hz])
E
6
4
2
0
0
1
8
B
2
3
log10 (Frequency [Hz])
4
F
6
4
2
0
4
0
1
8
C
2
3
log10 (Frequency [Hz])
4
G
6
4
2
0
4
Log10 (pA^2/Hz)
8
A
4
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
2
3
log10 (Frequency [Hz])
6
0
Log10 (pA^2/Hz)
1
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
0
8
0
1
8
D
2
3
log10 (Frequency [Hz])
4
H
6
4
2
0
4
0
1
2
3
log10 (Frequency [Hz])
4
Figure [4.4.2.7]. Power spectra of simulated data at SNR 5 with stochastic baseline fluctuations, baseline corrected with Viterbi-Kalman. A
Simulated data with SimBaseStd 0.05. B SimBaseStd 0.02. C SimBaseStd 0.01. D, H True data. E Data in A baseline corrected. F Data in
B baseline corrected. G Data in C baseline corrected.
128
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
6
4
2
0
4
2
0
1
2
3
log10 (Frequency [Hz])
6
4
2
0
1
2
3
log10 (Frequency [Hz])
6
4
2
0
0
1
2
3
log10 (Frequency [Hz])
4
2
0
0
B
1
2
3
log10 (Frequency [Hz])
4
F
6
4
2
0
4
0
1
2
3
log10 (Frequency [Hz])
6
C
4
G
4
2
0
4
Log10 (pA^2/Hz)
0
E
6
4
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
2
3
log10 (Frequency [Hz])
6
0
Log10 (pA^2/Hz)
1
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
0
A
0
1
2
3
log10 (Frequency [Hz])
6
D
4
H
4
2
0
4
0
1
2
3
log10 (Frequency [Hz])
4
Figure [4.4.2.8]. Power spectra of simulated data at SNR 2 with stochastic baseline fluctuations, baseline corrected with ForwardBackward-Kalman. A Simulated data with SimBaseStd 0.05. B SimBaseStd 0.02. C SimBaseStd 0.01. D, H True data. E Data in A
baseline corrected. F Data in B baseline corrected. G Data in C baseline corrected.
129
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
6
4
2
0
4
2
0
1
2
3
log10 (Frequency [Hz])
6
4
2
0
1
2
3
log10 (Frequency [Hz])
6
4
2
0
0
1
2
3
log10 (Frequency [Hz])
4
2
0
0
B
1
2
3
log10 (Frequency [Hz])
4
F
6
4
2
0
4
0
C
1
2
3
log10 (Frequency [Hz])
4
G
6
4
2
0
4
Log10 (pA^2/Hz)
0
E
6
4
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
2
3
log10 (Frequency [Hz])
6
0
Log10 (pA^2/Hz)
1
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
0
A
0
D
1
2
3
log10 (Frequency [Hz])
4
H
6
4
2
0
4
0
1
2
3
log10 (Frequency [Hz])
4
Figure [4.4.2.9]. Power spectra of simulated data at SNR 1 with stochastic baseline fluctuations, baseline corrected with ForwardBackward-Kalman. A Simulated data with SimBaseStd 0.05. B SimBaseStd 0.02. C SimBaseStd 0.01. D, H True data. E Data in A
baseline corrected. F Data in B baseline corrected. G Data in C baseline corrected.
130
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
8
6
4
2
0
8
4
2
0
1
8
2
3
log10 (Frequency [Hz])
6
4
2
0
0
1
2
3
log10 (Frequency [Hz])
6
4
2
0
0
1
2
3
log10 (Frequency [Hz])
E
6
4
2
0
0
1
8
B
2
3
log10 (Frequency [Hz])
4
F
6
4
2
0
4
0
1
8
C
2
3
log10 (Frequency [Hz])
4
G
6
4
2
0
4
Log10 (pA^2/Hz)
8
A
4
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
2
3
log10 (Frequency [Hz])
6
0
Log10 (pA^2/Hz)
1
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
0
8
0
1
8
D
2
3
log10 (Frequency [Hz])
4
H
6
4
2
0
4
0
1
2
3
log10 (Frequency [Hz])
4
Figure [4.4.2.10]. Power spectra of simulated data at SNR 5 with stochastic baseline fluctuations, using SimBaseStd 0.02, baseline
corrected with Viterbi-Kalman and Forward-Backward-Kalman, with different choices of BaseStd. A Corrected with VK, BaseStd 0.05. B
Corrected with VK, BaseStd 0.02. C Corrected with VK, BaseStd 0.01. D Corrected with VK, BaseStd 0.005. E Corrected with FBK,
BaseStd 0.05. F Corrected with FBK, BaseStd 0.02. G Corrected with FBK, BaseStd 0.01. H Corrected with FBK, BaseStd 0.005. Notice
that the method seems quite robust and insensitive to the choice of BaseStd.
131
A
(Meas-True)/True
B
SNR 5
0.1
0.1
0.05
Std0
Amp1
0
0.05
Std0
Amp1
Std1
0
-0.05
-0.1
0
C
Std1
-0.05
-0.1
(Meas-True)/True
SNR 5
0.15
0
0.01 0.02 0.05 0.1
D
SNR 2
0.01
0.005
0
-0.005
-0.01
-0.015
-0.02
-0.025
-0.03
-0.035
Std0
0.05
0.1
E
SNR 1
0.1
0.08
0.003
0.06
0.002
Std0
0.04
Amp1
Amp1
Std1
0.02
Std1
0
0.01 0.02
BaseStd
0.05
0.001
Std0
0
-0.001
0
-0.002
-0.02
-0.003
-0.04
SNR 0.5
0.004
Amp1
Std1
-0.004
0
0.01
0.02
BaseStd
0.05
0
0.01
BaseStd
Figure [4.4.2.11]. Re-estimation of conductance parameters data simulated at different SNR, with stochastic baseline artifacts. A Data
idealized and baseline corrected with Viterbi-Kalman. B, C, D, E Data idealized and baseline corrected with Forward-Backward-Kalman.
The baseline correction and idealization procedure replaces the standard method, in Segmental k-means and Baum-Welch, respectively. For
BaseStd ≤ 0.02, conductance parameters are underestimated by less than 5%. For BaseStd > 0.02 (large baseline fluctuations), the data is
not idealized correctly, thus conductance parameters may be overestimated (especially noise).
132
(Meas-True)/True
A
B
Viterbi-Kalman
0.06
0.04
0.02
0
-0.02
-0.04
-0.06
-0.08
-0.1
-0.12
Std0
Amp1
Std1
0.005
0.01
0.02
Forward-Backward-Kalman
0.06
0.04
0.02
0
-0.02
-0.04
-0.06
-0.08
-0.1
-0.12
Std0
Amp1
Std1
0.005
0.05
BaseStd
0.01
0.02
0.05
BaseStd
Figure [4.4.2.12]. Conductance parameter re-estimation from simulated data at SNR 5 using
stochastic baseline fluctuations, with SimBaseStd 0.02, baseline corrected with Viterbi-Kalman and
Forward-Backward-Kalman, with different BaseStd. A Data baseline corrected with Viterbi-Kalman.
B Data baseline corrected with Forward-Backward-Kalman. For both algorithms, the errors in
amplitude estimation are very small and insensitive to the choice of BaseStd. For low BaseStd, not
all the fluctuations are tracked, thus noise is overestimated. For high BaseStd, part of the state noise
is attributed to baseline fluctuations, thus noise is underestimated.
133
A
B
C
MIL
MPL
MIP
Viterbi-Kalman
10000
10000
K12
K12
1000
1000
K21
K23
K32
K32
TrueK12
100
K21
K23
TrueK12
100
TrueK21
TrueK21
TrueK23
TrueK23
TrueK32
TrueK32
10
10
0
0
0.01 0.02 0.05 0.1
Forward-Backward-Kalman
D
E
10000
F
10000
10000
K12
1000
K21
K21
0.05
0.1
Sim BaseStd
K21
K23
K32
K32
K32
TrueK12
100
TrueK21
TrueK23
TrueK23
TrueK32
TrueK32
10
0
K12
1000
K23
TrueK21
10
K12
1000
K23
TrueK12
100
0.01 0.02 0.05 0.1
TrueK12
100
TrueK21
TrueK23
TrueK32
10
0
0.05
0.1
Sim BaseStd
0
0.05
0.1
Sim BaseStd
Figure [4.4.2.13]. Kinetic parameter estimation from data simulated at SNR 5 with stochastic baseline fluctuations. Data simulated with
different SimBaseStd is baseline corrected and idealized with Viterbi-Kalman and Forward-Backward-Kalman. The kinetics parameters are
estimated with MIL and MIP from idealized data and with MPL from raw data. The thick lines are the parameters estimated from baseline
corrected data, while thin lines are parameters estimated from true data, using the same procedures. For SimBaseStd ≤ 0.05, the baseline
correction does not cause any visible alteration of kinetics. At low SimBaseStd, VK and FBK correct almost identically, while at high
SimBaseStd FBK is significantly better. MIL, MPL and MIP give nearly identical estimates.
134
C
B
A
SNR 2
10000
10000
K12
1000
K12
1000
K21
100
K23
K32
K32
K32
TrueK12
TrueK12
100
TrueK23
TrueK23
TrueK32
TrueK32
0.02
0.05
10000
0.01
0.02
0.05
TrueK23
TrueK32
100
0.01
0.02
0.05
Sim BaseStd
0.05
K12
K12
1000
K21
1000
K21
K23
K23
K32
K32
K32
TrueK12
TrueK12
100
TrueK21
TrueK23
TrueK23
TrueK32
TrueK32
TrueK12
100
TrueK21
TrueK23
TrueK32
10
10
0
0.02
K23
TrueK21
10
0.01
10000
K12
K21
0
F
10000
1000
TrueK21
10
0
E
TrueK12
100
TrueK21
10
0.01
K21
K23
TrueK21
0
D
K12
1000
K21
K23
10
SNR 1
MIP
MPL
MIL
10000
0
0.01
0.02
Sim BaseStd
0.05
0
0.01
0.02
0.05
Sim BaseStd
Figure [4.4.2.14]. Kinetic parameter estimation from data simulated at SNR 2 and 1 with stochastic baseline fluctuations. Data simulated
with different SimBaseStd is baseline corrected and idealized with Forward-Backward-Kalman. The kinetics parameters are estimated with
MIL and MIP from idealized data and with MPL from raw data. The thick lines are the parameters estimated from baseline corrected data,
while thin lines are parameters estimated from true data, using the same procedures. For SimBaseStd ≤ 0.02, the baseline correction does
not cause any visible alteration of kinetics, at either SNR 2 or 1. Regardless of SimBaseStd, at SNR 1 only MPL gives reliable kinetic
estimates, as expected. At SNR 2, MPL is more accurate, but MIL and MIP are close.
135
A
B
C
MIL
MPL
Viterbi-Kalman
10000
K12
K12
1000
K21
100
1000
K23
K32
K32
K32
TrueK12
TrueK12
100
TrueK23
TrueK23
TrueK32
TrueK32
0.05
10000
0.02
0.05
TrueK23
TrueK32
100
0.02
BaseStd
0.05
K12
K12
1000
K21
1000
K21
K23
K23
K32
K32
K32
TrueK12
TrueK12
100
TrueK21
TrueK23
TrueK23
TrueK32
TrueK32
TrueK12
100
TrueK21
TrueK23
TrueK32
10
10
0.005 0.01
0.05
K23
TrueK21
10
0.02
10000
K12
K21
0.005 0.01
F
10000
1000
TrueK21
10
0.005 0.01
E
TrueK12
100
TrueK21
10
0.02
K21
K23
TrueK21
0.005 0.01
D
K12
1000
K21
K23
10
Forward-Backward-Kalman
MIP
10000
10000
0.005 0.01
0.02
BaseStd
0.05
0.005 0.01
0.02
0.05
BaseStd
Figure [4.4.2.15]. Kinetic parameter estimation from data simulated at SNR 5, with stochastic baseline fluctuations using SimBaseStd
0.02. Data is baseline corrected and idealized with Viterbi-Kalman and Forward-Backward-Kalman, with different BaseStd. The kinetics
parameters are estimated with MIL and MIP from idealized data and with MPL from raw data. The thick lines are the parameters estimated
from baseline corrected data, while thin lines are parameters estimated from true data, using the same procedures. When data is corrected
with VK, too low or too high values of BaseStd cause some small inaccuracies in kinetic estimation, with any of MIL, MPL or MIP (with
MIP slightly worse). When FBK is used, the estimated kinetics is very accurate, irrespective of the optimization method.
136
4.4.3 Idealization and baseline correction of simulated data contaminated with
sinewave artifacts
The two baseline correction algorithms are optimal for stochastic baseline fluctuations
generated by a linear Gaussian model. It is of considerable interest to determine how well
sinewave artifacts can be removed. To do so, I have designed and performed experiments similar
to those described in the preceding chapter, with the difference that the baseline artifacts were in
the form of sinewave interference.
The amplitude (0.5 or 1.0) and the frequency (60 or 600 Hz) of the sinewave, relative to
the open class amplitude and sampling frequency, respectively, should cover most practical, even
extreme situations. The parameter BaseStd of the two baseline correction algorithms has no
longer a direct meaning. Empirically, I found and used the values that worked best (results not
shown). The optimal values for BaseStd apparently depend mostly on SineAmp and SineFreq,
irrespective of SNR. They are: 0.01 for SineAmp0.5 and SineFreq 60, 0.02 for SineAmp 1 and
SineFreq 60 and 0.05 for SineAmp 0.5 or 1 and SineFreq 600.
Figures [4.4.3.1] through [4.4.3.8] show the performance of the baseline correction in the
time domain. Viterbi-Kalman is used to correct only data with SNR 5 (Figures [4.4.3.1] and
[4.4.3.2]). Forward-Backward-Kalman is used to correct data with SNR 5 and below (Figures
[4.4.3.3] through 4.4.3.8]). At SNR 5, VK makes practically no mistakes for SineFreq 60 Hz and
SineAmp0.5 and 1.0. For SineFreq 600 Hz, some errors appear at SineAmp 0.5, and they
become quite serious at SineAmp 1.0. FBK shows identical performance at SNR 5, with serious
errors visible only for SineFreq 600 Hz and SineAmp 1.0. At lower SNR, FBK performs again
very well. The limits seem to be SNR 2 with SineFreq 600 Hz and SineAmp 1, and SNR 1 with
SineFreq 600 and SineAmp 0.5. These limits approximately match the point where the trained
electrophysiologist would not hesitate to abort the patch and move on.
137
The next set of figures, [4.4.3.9] through [4.4.3.13], shows the performance of the
baseline correction in the frequency domain. In all but the extreme situations, the 60 or 600 Hz
component is eliminated, without visible alterations in the overall spectrum. At SNR 5, SineFreq
600, SineAmp 0.5 (but surprisingly not SineAmp 1.0), both VK and FBK leave some 600 Hz
residual. At SNR 2, SineFreq 600 Hz, the residual left by FBK is much more pronounced,
especially at SineAmp 1.0. At SNR 1, the periodic component leftover is serious, even at
SineFreq 60 Hz and SineAmp 0.5 and 1.0.
The next two figures, [4.4.3.14] and [4.4.3.15], show the re-estimation of conductance
parameters from baseline corrected data. The re-estimation error is approximately equal for
Viterbi-Kalman and Forward-Backward-Kalman. The parameters are mostly underestimated,
except outside the operational limits, when noise overestimation occurs. Again, the elimination of
baseline fluctuations seems to have an effect similar to that of a low-pass filter.
The last set of figures, [4.4.3.16] through [4.4.3.19], shows the estimation of kinetics
parameters from baseline corrected data. The four rate constants are plotted as a function of
SineAmp. The results are similar. Wherever the baseline correction succeeds, the estimated
kinetics are quite accurate. MPL is better than MIL or MIP. MIL is somewhat better than MIP,
and mostly when associated with Viterbi-Kalman.
138
A
B
2 pA
2 pA
2 ms
50 ms
2 pA
2 pA
50 ms
2 ms
2 pA
C
2 pA
2 ms
50 ms
2 pA
D
2 pA
2 ms
50 ms
2 pA
E
2 pA
2 ms
50 ms
Figure [4.4.3.1]. Simulated data at SNR 5 with sinewave baseline artifacts of SineFreq 60 Hz and different SineAmp, idealized and
baseline corrected with Viterbi-Kalman. A Simulated data with SineAmp 0.5. B SineAmp 1.0. C Raw data without sinewave, idealized
with Viterbi. D Data in A baseline corrected and idealized, with BaseStd 0.01. E. Data in B baseline corrected and idealized, with BaseStd
0.02. There are no visible mistakes.
139
2 pA
2 pA
A
50 ms
2 ms
2 pA
2 pA
B
50 ms
2 ms
2 pA
C
2 pA
2 ms
50 ms
2 pA
D
1 pA
2 ms
50 ms
2 pA
E
2 pA
2 ms
50 ms
Figure [4.4.3.2]. Simulated data at SNR 5 with sinewave baseline artifacts of SineFreq 600 Hz and different SineAmp, idealized and
baseline corrected with Viterbi-Kalman. A Simulated data with SineAmp 0.5. B SineAmp 1.0. C Raw data without sinewave, idealized
with Viterbi. D Data in A baseline corrected and idealized, with BaseStd 0.05. E. Data in B baseline corrected and idealized, with BaseStd
0.05. There are some visible mistakes for SineAmp 1.0, when the sinusoid upward slope can be mistaken for a channel opening.
140
A
B
2 pA
2 pA
2 ms
50 ms
2 pA
2 pA
50 ms
2 ms
2 pA
C
2 pA
2 ms
50 ms
2 pA
D
2 pA
2 ms
50 ms
2 pA
E
2 pA
2 ms
50 ms
Figure [4.4.3.3]. Simulated data at SNR 5 with sinewave baseline artifacts of SineFreq 60 Hz and different SineAmp, idealized and
baseline corrected with Forward-Backward-Kalman. A Simulated data with SineAmp 0.5. B SineAmp 1.0. C Raw data without sinewave,
idealized with Forward-Backward. D Data in A baseline corrected and idealized, with BaseStd 0.01. E. Data in B baseline corrected and
idealized, with BaseStd 0.02. There are no visible mistakes.
141
2 pA
2 pA
A
50 ms
2 ms
2 pA
2 pA
B
50 ms
2 ms
2 pA
C
2 pA
2 ms
50 ms
2 pA
D
2 pA
2 ms
50 ms
2 pA
E
2 pA
2 ms
50 ms
Figure [4.4.3.4]. Simulated data at SNR 5 with sinewave baseline artifacts of SineFreq 600 Hz and different SineAmp, idealized and
baseline corrected with Forward-Backward-Kalman. A Simulated data with SineAmp 0.5. B SineAmp 1.0. C Raw data without sinewave,
idealized with Forward-Backward. D Data in A baseline corrected and idealized, with BaseStd 0.05. E. Data in B baseline corrected and
idealized, with BaseStd 0.05. There are some visible mistakes for SineAmp 1.0, when the sinusoid upward slope can be mistaken for a
channel opening.
Conclusions?
142
A
2 pA
2 pA
50 ms
2 ms
5 pA
5 pA
B
C
D
E
50 ms
2 ms
2 pA
2 pA
2 ms
50 ms
2 pA
2 pA
2 ms
50 ms
2 pA
2 pA
2 ms
50 ms
Figure [4.4.3.5]. Simulated data at SNR 2 with sinewave baseline artifacts of SineFreq 60 Hz and different SineAmp, idealized and
baseline corrected with Forward-Backward-Kalman. A Simulated data with SineAmp 0.5. B SineAmp 1.0. C Raw data without sinewave,
idealized with Forward-Backward. D Data in A baseline corrected and idealized, with BaseStd 0.01. E. Data in B baseline corrected and
idealized, with BaseStd 0.02. There are some visible mistakes, as expected for this low SNR, almost irrespective of the sonusoid.
143
5 pA
2 pA
A
50 ms
2 ms
50 ms
2 pA
2 ms
5 pA
B
C
2 pA
D
2 pA
E
2 pA
2 pA
2 ms
50 ms
2 pA
50 ms
2 ms
50 ms
2 pA
2 ms
Figure [4.4.3.6]. Simulated data at SNR 2 with sinewave baseline artifacts of SineFreq 600 Hz and different SineAmp, idealized and
baseline corrected with Forward-Backward-Kalman. A Simulated data with SineAmp 0.5. B SineAmp 1.0. C Raw data without sinewave,
idealized with Forward-Backward. D Data in A baseline corrected and idealized, with BaseStd 0.05. E. Data in B baseline corrected and
idealized, with BaseStd 0.05. For SineAmp 1.0, the baseline tracking totally fails, as the sinusoid is mistaken for channel openings.
144
A
B
C
D
E
5 pA
5 pA
2 ms
50 ms
5 pA
5 pA
2 ms
50 ms
5 pA
5 pA
2 ms
50 ms
5 pA
5 pA
2 ms
50 ms
5 pA
5 pA
2 ms
50 ms
Figure [4.4.3.7]. Simulated data at SNR 1 with sinewave baseline artifacts of SineFreq 60 Hz and different SineAmp, idealized and
baseline corrected with Forward-Backward-Kalman. A Simulated data with SineAmp 0.5. B SineAmp 1.0. C Raw data without sinewave,
idealized with Forward-Backward. D Data in A baseline corrected and idealized, with BaseStd 0.01. E. Data in B baseline corrected and
idealized, with BaseStd 0.02. The idealization mistakes are as expected for this low SNR, with some additional caused by the sinusoid.
145
A
B
C
D
5 pA
5 pA
50 ms
2 ms
5 pA
5 pA
2 ms
50 ms
5 pA
5 pA
2 ms
50 ms
5 pA
5 pA
2 ms
50 ms
Figure [4.4.3.8]. Simulated data at SNR 1 with sinewave baseline artifacts of SineFreq 600 Hz and different SineAmp, idealized and
baseline corrected with Forward-Backward-Kalman. A Simulated data with SineAmp 0.5. B SineAmp 1.0. C Raw data without sinewave,
idealized with Forward-Backward. D Data in A baseline corrected and idealized, with BaseStd 0.05. For SineAmp 0.5, there are significant
mistakes caused by the sinusoid. Data in B, with SineAmp 1.0, could not be idealized.
146
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
3
2
1
0
-1
2
1
0
-1
1
2
3
log10 (Frequency [Hz])
3
2
1
0
-1
1
2
3
log10 (Frequency [Hz])
3
2
1
0
-1
0
1
2
3
log10 (Frequency [Hz])
2
1
0
-1
0
B
1
2
3
log10 (Frequency [Hz])
2
1
0
-1
0
1
2
3
log10 (Frequency [Hz])
3
C
0
1
2
3
log10 (Frequency [Hz])
3
D
4
H
2
1
0
-1
4
4
G
2
1
0
-1
4
4
F
3
4
Log10 (pA^2/Hz)
0
E
3
4
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
2
3
log10 (Frequency [Hz])
3
0
Log10 (pA^2/Hz)
1
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
0
A
0
1
2
3
log10 (Frequency [Hz])
4
Figure [4.4.3.9]. Power spectra of simulated data at SNR 5, with sinewave interference with SineFreq 60 Hz and different SineAmp,
baseline corrected. A Simulated data with SineAmp 0.5. E SineAmp 1.0. B, F Data without sinewave. C, G Data in A and E baseline
corrected with Viterbi-Kalman. D, H. Data in A and E baseline corrected with Forward-Backward-Kalman. Both VK and FBK completely
eliminate the 60 Hz component.
147
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
3
2
1
0
-1
2
1
0
-1
1
2
3
log10 (Frequency [Hz])
3
2
1
0
-1
1
2
3
log10 (Frequency [Hz])
3
2
1
0
-1
0
1
2
3
log10 (Frequency [Hz])
2
1
0
-1
0
B
1
2
3
log10 (Frequency [Hz])
2
1
0
-1
0
1
2
3
log10 (Frequency [Hz])
3
C
0
1
2
3
log10 (Frequency [Hz])
3
D
4
H
2
1
0
-1
4
4
G
2
1
0
-1
4
4
F
3
4
Log10 (pA^2/Hz)
0
E
3
4
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
2
3
log10 (Frequency [Hz])
3
0
Log10 (pA^2/Hz)
1
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
0
A
0
1
2
3
log10 (Frequency [Hz])
4
Figure [4.4.3.10]. Power spectra of simulated data at SNR 5, with sinewave interference with SineFreq 600 Hz and different SineAmp,
baseline corrected. A Simulated data with SineAmp 0.5. E SineAmp 1.0. B, F Data without sinewave. C, G Data in A and E baseline
corrected with Viterbi-Kalman. D, H. Data in A and E baseline corrected with Forward-Backward-Kalman. Both VK and FBK completely
eliminate the 600 Hz component.
148
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
3
2
1
0
-1
2
3
log10 (Frequency [Hz])
3
2
1
0
-1
1
2
3
log10 (Frequency [Hz])
3
2
1
0
-1
0
1
2
3
log10 (Frequency [Hz])
E
3
2
1
0
-1
4
0
B
1
2
3
log10 (Frequency [Hz])
4
F
3
2
1
0
-1
4
Log10 (pA^2/Hz)
0
Log10 (pA^2/Hz)
1
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
0
A
0
1
2
3
log10 (Frequency [Hz])
3
2
C
4
G
1
0
-1
4
0
1
2
3
log10 (Frequency [Hz])
4
Figure [4.4.3.11]. Power spectra of simulated data at SNR 2, with sinewave interference with SineFreq 60 Hz and different SineAmp,
baseline corrected. A Simulated data with SineAmp 0.5. D SineAmp 1.0. B, F Data without sinewave. C, G Data in A and E baseline
corrected with Forward-Backward-Kalman. The 60 Hz component is completely eliminated.
149
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
3
2
1
0
-1
2
3
log10 (Frequency [Hz])
3
2
1
0
-1
1
2
3
log10 (Frequency [Hz])
3
2
1
0
-1
0
1
2
3
log10 (Frequency [Hz])
E
3
2
1
0
-1
4
0
B
1
2
3
log10 (Frequency [Hz])
4
F
3
2
1
0
-1
4
Log10 (pA^2/Hz)
0
Log10 (pA^2/Hz)
1
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
0
A
0
1
2
3
log10 (Frequency [Hz])
3
2
C
4
G
1
0
-1
4
0
1
2
3
log10 (Frequency [Hz])
4
Figure [4.4.3.12]. Power spectra of simulated data at SNR 2, with sinewave interference with SineFreq 600 Hz and different SineAmp,
baseline corrected. A Simulated data with SineAmp 0.5. E SineAmp 1.0. B, F Data without sinewave. C, G Data in A and E baseline
corrected with Forward-Backward-Kalman. The 600 Hz component is very much reduced for SineAmp 0.5, but a significant residual
remains for SineAmp 1.0.
150
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
3
2
1
0
2
3
log10 (Frequency [Hz])
3
2
1
0
1
2
3
log10 (Frequency [Hz])
3
2
1
0
0
1
2
3
log10 (Frequency [Hz])
E
3
2
1
0
4
0
B
1
2
3
log10 (Frequency [Hz])
4
F
3
2
1
0
4
Log10 (pA^2/Hz)
0
Log10 (pA^2/Hz)
1
Log10 (pA^2/Hz)
Log10 (pA^2/Hz)
0
A
0
1
2
3
log10 (Frequency [Hz])
G
3
C
4
2
1
0
4
0
1
2
3
log10 (Frequency [Hz])
4
Figure [4.4.3.13]. Power spectra of simulated data at SNR 1, with sinewave interference with SineFreq 60 Hz and different SineAmp,
baseline corrected. A Simulated data with SineAmp 0.5. D SineAmp 1.0. B, F Data without sinewave. C, G Data in A and E baseline
corrected with Forward-Backward-Kalman. At this low SNR, the 60 Hz component is reduced but not eliminated.
151
A
B
60 Hz
0.02
(Meas-True)/True
(Meas-True)/True
Viterbi-Kalman
0.01
0
Std0
-0.01
Amp1
Std1
-0.02
-0.03
-0.04
-0.05
0.5
Std0
Amp1
Std1
-0.06
-0.08
-0.1
-0.12
1
0
C
0.5
1
D
0.02
0.005
0
-0.005
-0.01
-0.015
-0.02
-0.025
-0.03
-0.035
-0.04
-0.045
(Meas-True)/True
(Meas-True)/True
0
-0.02
-0.04
-0.14
0
Forward-Backward-Kalman
600 Hz
Std0
Amp1
Std1
0
-0.02
Std0
Amp1
-0.04
Std1
-0.06
-0.08
-0.1
-0.12
0
0.5
0
1
Sine amp
0.5
1
Sine amp
Figure [4.4.3.14]. Conductance parameter re-estimation from data simulated at SNR 5, with sinewave interference, at different SineFreq
and SineAmp, baseline corrected and idealized. A and C Simulated data with SineFreq 60 Hz. B and D Simulated data at 600 Hz. A and B
Data baseline corrected and idealized with Viterbi-Kalman. C and D Data baseline corrected and idealized with Forward-BackwardKalman. The amplitude is estimated accurately. For SineAmp 0.5, the noise is underestimated by less than 5% for SineFreq 60 Hz and by
less than 10% for SineFreq 600 Hz. VK and FBK give nearly identical results.
152
B
60 Hz
0.004
0.002
0
-0.002
-0.004
-0.006
-0.008
-0.01
-0.012
-0.014
-0.016
Std0
Amp1
Std1
0.5
0.08
Std0
0.06
Amp1
0.04
Std1
0.02
0
-0.02
1
0
C
0.5
1
D
0.2
(Meas-True)/True
0.005
(Meas-True)/True
0.1
-0.04
0
SNR 1
600 Hz
0.12
(Meas-True)/True
(Meas-True)/True
SNR 2
A
0
-0.005
Std0
Amp1
-0.01
Std1
-0.015
-0.02
-0.025
0.15
Std0
Amp1
0.1
Std1
0.05
0
-0.05
-0.03
0
0.5
0
1
Sine amp
0.5
Sine amp
Figure [4.4.3.15]. Conductance parameter re-estimation from data simulated at SNR 2 and 1, with sinewave interference, at different
SineFreq and SineAmp, baseline corrected and idealized with Forward-Backward-Kalman. A and C Simulated data with SineFreq 60 Hz.
B and D Simulated data at 600 Hz. A and B SNR 2. C and D SNR 1. The amplitude and the noise are estimated accurately, except for
SineFreq 600 Hz, when correction fails.
153
A
B
C
MIL
MPL
10000
MIP
10000
10000
K12
60 Hz
1000
K21
100
1000
K21
K23
K23
K32
K32
K32
TrueK12
TrueK12
100
TrueK21
TrueK21
TrueK23
TrueK23
TrueK32
TrueK32
0.5
1
D
0.5
TrueK23
TrueK32
0
1
K21
100
K21
K23
K32
K32
TrueK12
TrueK12
100
0.5
Sine amp
1
1000
K21
K23
K32
TrueK12
100
TrueK21
TrueK21
TrueK23
TrueK23
TrueK23
TrueK32
TrueK32
10
0
K12
K12
1000
K23
TrueK21
10
1
10000
K12
1000
0.5
F
10000
10000
TrueK21
10
0
E
TrueK12
100
10
0
K21
K23
10
600 Hz
K12
K12
1000
0
0.5
Sine amp
1
TrueK32
10
0
0.5
1
Sine amp
Figure [4.4.3.16]. Kinetic parameter estimation from data simulated at SNR 5 with sinewave interference. Data simulated with different
SineFreq and SineAmp is baseline corrected and idealized with Viterbi-Kalman. The kinetics parameters are estimated with MIL and MIP
from idealized data and with MPL from raw data. The thick lines are the parameters estimated from baseline corrected data, while thin lines
are parameters estimated from true data, using the same procedures. For 60 Hz, the correction does not cause a significant alteration in the
estimated kinetics. For 600 Hz, the correction changes significantly the kinetics for SineAmp 1.0.
154
A
B
C
MIL
MPL
MIP
10000
10000
10000
60 Hz
K12
1000
100
K23
K32
K32
K32
TrueK12
TrueK12
100
TrueK21
TrueK23
TrueK23
TrueK32
TrueK32
0.5
1
0.5
1
TrueK23
TrueK32
100
0.5
Sine amp
1
K12
K12
1000
K21
1000
K21
K23
K23
K32
K32
K32
TrueK12
TrueK12
100
TrueK21
TrueK23
TrueK23
TrueK32
TrueK32
TrueK12
100
TrueK21
TrueK23
TrueK32
10
10
0
1
K23
TrueK21
10
0.5
10000
K12
K21
0
F
10000
1000
TrueK21
10
0
E
TrueK12
100
TrueK21
10
0
K21
K23
10000
600 Hz
1000
K21
K23
10
D
K12
K12
1000
K21
0
0.5
Sine amp
1
0
0.5
1
Sine amp
Figure [4.4.3.17]. Kinetic parameter estimation from data simulated at SNR 5 with sinewave interference. Data simulated with different
SineFreq and SineAmp is baseline corrected and idealized with Forward-Backward-Kalman. The kinetics parameters are estimated with
MIL and MIP from idealized data and with MPL from raw data. The thick lines are the parameters estimated from baseline corrected data,
while thin lines are parameters estimated from true data, using the same procedures. For 60 Hz, the correction does not cause any visible
change in the estimated kinetics. For 600 Hz, the correction changes the kinetics for SineAmp 1.0.
155
A
B
C
MIL
MPL
10000
MIP
10000
10000
60 Hz
K12
1000
100
K23
K32
K32
K32
TrueK12
TrueK12
100
TrueK21
TrueK23
TrueK23
TrueK32
TrueK32
0.5
1
0.5
1
TrueK23
TrueK32
100
K12
1000
K21
0.5
Sine amp
1
K12
1000
K21
K23
K23
K32
K32
K32
TrueK12
TrueK12
100
TrueK21
TrueK23
TrueK23
TrueK32
TrueK32
10
0
1
K23
TrueK21
10
0.5
10000
K12
K21
0
F
10000
1000
TrueK21
10
0
E
TrueK12
100
TrueK21
10
0
K21
K23
10000
600 Hz
1000
K21
K23
10
D
K12
K12
1000
K21
TrueK12
100
TrueK21
TrueK23
TrueK32
10
0
0.5
Sine amp
1
0
0.5
1
Sine amp
Figure [4.4.3.18]. Kinetic parameter estimation from data simulated at SNR 2 with sinewave interference. Data simulated with different
SineFreq and SineAmp is baseline corrected and idealized with Forward-Backward-Kalman. The kinetics parameters are estimated with
MIL and MIP from idealized data and with MPL from raw data. The thick lines are the parameters estimated from baseline corrected data,
while thin lines are parameters estimated from true data, using the same procedures. For 60 Hz, the correction does not cause a significant
change in the estimated kinetics. For 600 Hz, the correction fails for SineAmp > 0.5.
156
A
B
C
MIL
MPL
MIP
10000
10000
10000
60 Hz
1000
K21
100
K23
K32
K32
K32
TrueK12
TrueK12
100
TrueK21
TrueK23
TrueK23
TrueK23
TrueK32
TrueK32
TrueK32
10
10
0.5
1
0
E
0.5
1
100
0.5
Sine amp
K12
K12
1000
K21
1000
K21
K23
K23
K32
K32
K32
TrueK12
TrueK12
100
TrueK21
TrueK23
TrueK23
TrueK32
TrueK32
TrueK12
100
TrueK21
TrueK23
TrueK32
10
10
0
1
K23
TrueK21
10
0.5
10000
K12
K21
0
F
10000
1000
TrueK12
100
TrueK21
TrueK21
0
K21
K23
10000
600 Hz
1000
K21
K23
10
D
K12
K12
K12
1000
0
0.5
Sine amp
0
0.5
Sine amp
Figure [4.4.3.19]. Kinetic parameter estimation from data simulated at SNR 1 with sinewave interference. Data simulated with different
SineFreq and SineAmp is baseline corrected and idealized with Forward-Backward-Kalman. The kinetic parameters are estimated with
MIL and MIP from idealized data and with MPL from raw data. The thick lines are the parameters estimated from baseline corrected data,
while thin lines are parameters estimated from true data, using the same procedures. For 60 Hz, the correction causes some change in the
estimated kinetics, for SineAmp 1.0. For 600 Hz, the correction changes significantly the kinetics for SineAmp 0.5 and totally fails for
SineAmp 1.0.
157
4.4.4 Idealization and baseline correction of experimental data
After extensive testing with simulated data, the two baseline correction algorithms are
applied to real data. Real data, with real baseline fluctuations, deviates from the ideal hidden
Markov and linear Gaussian models in several ways. Non-stationarity is the first problem. Both
the conductance and kinetics parameters may change in time for various reasons. Changing
kinetics has a small effect on idealization but changing conductance levels has drastic
consequences. The effects of non-stationarity can be alleviated by segmenting the data and
running the algorithms on homogeneous individual segments. The second problem arises from
non-ideal models. The most critical deviation appears to be the presence of correlation in the state
noise. The correlation has two potential sources: data that is over-filtered (above the Nyquist
sampling frequency), and intrinsic coloration arising from the physics of the channel.
Additionally, the noise is not precisely Gaussian with occasional spikes, heavy tails etc. The
linear Gaussian model description of baseline fluctuations is only phenomenological, while the
real fluctuations may vary from patch to patch and not remain stationary inside a record. All of
this results in non-optimal functioning of both the hidden Markov and linear Gaussian algorithms.
On the other hand, the exquisite performance of the algorithm on sinewave interference suggests
that the method is not sensitive to details of the noise model.
I tested the algorithms on many different experimental data files. Here, I show results
with one representative data set, provided by Dr. Asbed Keleshian, of the University at Buffalo.
The data in Figure [4.4.4.1] is a single channel recording of Asbed??? channels. Visual inspection
of the data provides clear evidence of conductance and kinetic non-stationarity. Thus, I have
segmented the data into six traces, and each segment was idealized separately. The kinetics were
estimated from all the segments together. The data requires at least two closed, but probably one
open state. I therefore chose the C-O-C model, for idealization and kinetics estimation. The SNR
is ≤ 5.
158
Figure [4.4.4.2] shows idealization obtained with Viterbi-Kalman and Figure [4.4.4.3]
with Forward-Backward-Kalman. As expected at this SNR, both algorithms perform similarly,
and make no obvious mistakes. Notice, however, the misclassification at the beginning of some
traces, committed by both VK and FBK. The idealization was totally automatic, and the initial
state probabilities were not manually initialized for each trace. The uncertainty at the beginning
of a trace may result in misclassification, but the algorithms quickly recover once evident
transitions are reached. This recovery is a very important property, guaranteeing that local
mistakes do not propagate. This is satisfactory for equilibrium data, but suggests that manual
intervention on starting values would be helpful for non-equilibrium data, i.e. data whose
evolution depends upon the proper starting values.
The non-stationarity in amplitude, kinetics and baseline is clear, in Figures [4.4.4.2] and
[4.4.4.3], especially in trace B, of the baseline corrected data. The amplitude of the open
conductance class decreases by about 30%, from ~3 pA to ~2.5 pA. I expected a BaseStd value
of 0.01-0.02 to be optimal. Since it is relative to the difference in conductance amplitudes, I
scaled up BaseStd, to a value of 0.04. Indeed, this value gave optimal results. Below this value,
the correction was not complete, and above it over-correction occurred (results not shown). Rate
constants were estimated with MIL, and are consistent whether the data was idealized with
Viterbi-Kalman or with Forward-Backward-Kalman.
159
A
5 pA
B
5 pA
20 ms
200 ms
C
2 pA
C
D
10 ms
1
100.0
100.0
2
1000.0
1000.0
3
Figure [4.4.4.1]. ]. File 02n28042, raw data from ?? channels. Sampling 20 kHz, bandwidth ?? Hz. A Medium resolution view. B Low
resolution view. C High resolution view. D Model used for idealization and baseline correction.
160
A
5 pA
B
5 pA
200 ms
20 ms
C
2 pA
C
10 ms
E
D
1
350.25
315.12
2
761.83
6304.62
3
Figure [4.4.4.2]. Raw data of Figure [4.4.4.1], idealized and baseline corrected using Viterbi-Kalman, with BaseStd 0.04. Each trace was
idealized separately to minimize errors due to non-stationarity. A Medium resolution view. B Low resolution view. C High resolution view.
D Model optimized by MIL, dead time 0.08 ms. E Misplaced baseline at the beginning of trace.
161
A
5 pA
B
5 pA
200 ms
20 ms
E
C
2 pA
C
10 ms
F
D
1
317.31
280.98
2
621.48
5025.27
3
Figure [4.4.4.3]. Raw data of Figure [4.4.4.1], idealized and baseline corrected using Forward-Backward-Kalman, with BaseStd 0.04. Each
trace was idealized separately. A Medium resolution view. B Low resolution view. C High resolution view. D Model optimized by MIL,
dead time 0.08 ms. E, F Misplaced baseline at the beginning of trace.
162
4.5 Discussion
Two new baseline correction algorithms, Viterbi-Kalman and Forward-BackwardKalman are tested with simulated and real single channel data. While optimal for stochastic
baseline fluctuations generated by a simple linear Gaussian model, they also correct very well for
sinewave interference and baseline artifacts found in real data.
One might be tempted to re-estimate the BaseStd parameter from data, using the Kalman
filter-smoother re-estimation formulae. That would eliminate the arbitrariness of choosing
BaseStd. I have tested the re-estimation of BaseStd, with negative results (not shown). For
artifacts simulated with the linear Gaussian model, BaseStd is in general re-estimated correctly,
although in some cases it fails, returning a zero that prevents correction. Any deterministic
artifact, such as drift or sinewave interference, causes the re-estimation to fail completely. Thus,
the total variance (signal plus state noise plus baseline) is transferred into the baseline process
variance, and the Markov signal is treated as noise or baseline fluctuations. Segmenting the data
into small chunks will reduce the effect of the deterministic components.
Currently, BaseStd must be an empirical choice. A BaseStd of 0.01-0.02, relative to a
difference in conductance class amplitudes of 1, seems to be a good choice for most situations.
BaseStd must be linearly scaled for other values of the conductance difference. The methods can
also correct data containing substates (results not shown). Small values of BaseStd may not
correct all baseline fluctuations, but they do not affect re-estimation of the conductance
parameters. Large values of BaseStd will result in underestimated state noise, and possibly result
in overcorrection when Markov state transitions are mistaken for abrupt changes in the baseline.
Importantly, both methods are robust and recover quickly from local misclassifications, unlike
some heuristic tracking algorithms. The larger the difference in noise between the closed and the
163
open states, the lower is the probability of misclassification. Also, longer data sets are less prone
to errors than short ones since the Markov estimates are done globally.
Further work is necessary to eliminate the misclassification errors that sometimes occur
at the beginning of data records. If manual assignment of initial state probabilities is possible,
such errors will not occur. One possible solution could be an initialization procedure: a short
segment in the beginning is processed forward. The last point state probabilities are used as initial
state probabilities for a reversed (right-to-left) forward step. The last step last point state
probabilities can be used now by the regular algorithm.
Forward-Backward-Kalman clearly has a much wider operational range than ViterbiKalman. The wider range comes at the price of computational complexity and speed. VK has the
same complexity as standard Viterbi, with the addition of a simple Kalman filter step for each
pair of Markov states, for each data point. Roughly, VK takes twice as long as Viterbi. FBK, on
the other hand, takes much longer than Forward-Backward. In the initialization step, state
inference at each data point is a calculation repeated until convergence. If the convergence is, on
the average, reached in ten iterations, the initialization step is approximately equivalent with
running the Forward procedure and the Kalman filter ten times each. Afterwards, a ForwardBackward procedure and a Kalman filter-smoother are alternated until convergence. If
convergence is reached, for example, in ten iterations, the overall cost of the Forward-BackwardKalman algorithm is roughly 30 times that of a standard Forward-Backward and a Kalman filtersmoother taken together. It is necessary to choose the convergence criteria carefully to reach a
useful compromise between accuracy and speed.
Future work should investigate whether it is possible to incorporate in the ForwardBackward-Kalman algorithm correction for correlated noise. Instead of using meta-states and thus
exponentially increasing the Markov model size, the noise correlation and filtering can be treated
by extending the state of the linear Gaussian model, at possibly a much smaller computational
164
cost. The same expanded LGM state can be used for removal of sinewave interference with an
extended Kalman filter.
165
5. Infinite state-space hidden Markov models. Application to single
molecule fluorescence data from kinesin
This chapter is devoted to the introduction of hidden Markov models to the study of
staircase data from single molecules. The focus is on fluorescence measurements of kinesin
molecular steps, prompted by collaborations with Dr. Paul Selvin of the University of Illinois.
The molecular motor mechanism of kinesin is reviewed by Vale and Milligan (2000). Briefly,
conventional kinesin is a highly processive molecular motor that is involved in membrane
transport, mitosis and meiosis, messenger RNA and protein transport, ciliary and flagellar
genesis, signal transduction, and microtubule polymer dynamics (Goldstein and Philp, 1999). A
single molecule of kinesin can take several hundred steps along a microtubule without detaching
(Howard et al, 1989; Block et al, 1990), stepping from one tubulin subunit to the next (a distance
of 8 nm) (Svoboda et al, 1993; Visscher et al, 1999). The unidirectional movement is caused by a
large conformational change in the “neck linker” region of kinesin (Rice et al, 1999). The neck
linker is mobile when kinesin is bound to microtubules that are nucleotide-free or in ADP-bound
states, but becomes docked to the catalytic core after binding an ATP. The energy released in
ATP binding drives a forward motion of the neck linker (and of any object attached to its COOHterminus). Thus, when ATP binding causes the neck linker of the forward head to dock, the
trailing head detaches from its binding site and is thrust forward to the next tubulin binding site.
The force produced is enough to pull the attached cargo forward by 8 nm. Figure [5.1], from Vale
and Milligan (2000), presents the succession of molecular steps.
Recent techniques have made possible visualization of individual molecules of kinesin
moving along the microtubule, with a resolution of a few nanometers (Selvin, 2002). Typically, a
fluorophore is attached to kinesin (onto the head or body), and a CCD camera on a high
magnification optical microscope records the position of the fluorescence signal. The
166
microtubules are fixed with respect to the microscope optical field. In off-line analysis, individual
molecules are localized in the field, and their trajectory is traced from frame to frame. The
distance between the present and the starting position of a molecule, projected along the tubulin
axis, is plotted as a function of time. Since kinesin moves in discrete forward steps, the resulting
data has a staircase appearance. Backward steps may occur occasionally as predicted for all
reversible reactions. The duration of each step has a stochastic nature, with an exponential
distribution. The magnitude of individual steps is generally assumed constant. Due to finite
sampling, the molecule may jump two or more steps between data points.
The presence of discrete levels in the data and their lifetime exponential distribution
(Okada and Hirokawa, 1999) motivated me to attempt to analyze the kinesin data with hidden
Markov models. However, direct application of standard Markov model algorithms to the kinesin
staircase data is not quite straight forward. The problem resides in the definition of states. The
kinesin molecule exists in a number of different conformations. Instantaneous transitions (at the
time resolution of the experiment) are possible between some of these kinetic states.
Occasionally, the molecule jumps to the next (or previous) position along the microtubule. The
jump originates in one kinetic state and terminates in the same or a different kinetic state. At each
position, the same set of states is available. Clearly, transitions between states without a position
jump cannot be substantiated directly, although they may be inferred from the statistics of the
state lifetimes. The kinetic states are aggregated into a single class. This is a major difference
from ion channels, which have (by definition) at least two distinct conductance classes.
Fortunately, the jump transitions can be observed. A description of the stepping process must
include a set of kinetic states and a number of transitions, with and without position change. The
occurrence of a step has to be somehow registered in the model. It is possible to fit small data sets
with our standard single channel programs (Jeney et al, 2000). Each position is simply called a
state. However, for larger data sets that require hundreds of states, that approach becomes too
computationally intensive.
167
Given a kinetic scheme for kinesin reactions, several quantities need to be estimated. The
amplitude of a single jump can be described, for example, by a Gaussian probability distribution
with the standard deviation resulting primarily from instrumental errors in measuring the position.
There may also be a contribution from the flexibility of kinesin itself. To complicate matters,
however, the step distribution may have several peaks (a convolution of a multinomial with a
Gaussian), since the substrate consists of repeated units, and the motor might reach over a number
of units in one step. The step parameters are analogous to the conductance parameters of ion
channels. As for ion channels, the rate constants governing state transitions (including those
resulting in position change) must be estimated.
An algorithm performing state inference and parameter re-estimation is necessary for
analysis of data sets consisting of many jumps. I assign each data point probabilistically to a
discrete level, and that level is located a known number of jumps from the starting position. With
level assignment, the average amplitude of each level is calculated. The averages are further
processed to determine the single jump distribution (assuming a unimodal distribution). These
levels, corresponding to different positions along the microtubule, must be treated as separate
discrete Markov states, or groups of aggregated states. From the inferred state sequence, Markov
model-based algorithms can, in principle, obtain maximum likelihood estimates of the transition
rates. Difficulties occur for a number of reasons. Since the labeling fluorophore has a limited
lifetime, the duration of a fluorescence record is finite. Nonetheless, finite does not equate to
small. The number of jumps occurring during the record is not known a priori.
In the present work, I model kinesin with an aggregated hidden Markov model with an
infinite number of states. The infinite-state model is truncated to a finite size having core states
with certain properties described in the following section. I use the truncated model to idealize the
data with a modified Forward-Backward procedure, similar to the Forward-Backward-Kalman
algorithm presented in Chapter 4. The idealization is used to determine the parameters of a single
168
jump. The kinetics parameters are also estimated from the idealization, using a modified version
of the MIP algorithm presented in Chapter 3.
A
B
C
D
Figure [5.1]. Motility cycles of conventional kinesin. A Each catalytic core (blue) is bound to a tubulin
heterodimer, along a microtubule protofilament. ATP binding to the leading head initiates neck linker
(red) docking. B Neck linker docking is completed by the leading head (yellow), which throws the trailing
head forward by 16 nm (arrow) toward the next tubulin binding site. C After a random diffusional search,
the new leading head docks tightly onto the binding site, and releases ADP. The trailing head hydrolyzes
ATP to ADP-Pi. D After ADP dissociates, an ATP binds to the leading head and the neck linker begins to
zipper onto the core (orange). The trailing head has released its Pi and detached its neck linker (red) from
the core and is in the process of being thrown forward. Adapted from Vale and Milligan, 2000.
169
5.1 Infinite state-space hidden Markov model
The kinesin molecule is assumed to have NK kinetic states, the set of which represents an
aggregation class. After each step (including the steps missed due to finite sampling), the
molecule enters a new aggregation class. A data set with NJ jumps requires NC = NJ + 1 classes,
and a total of NC*NK states, which differ either by the kinetic index k, or by the class (position)
index c. The resulting Markov model has a Q matrix of dimension (NC*NK) x (NC*NK). The Q
matrix has some interesting properties. The kinetic states in class c are directly connected only to
kinetic states in class c+1 (forward movement), and possibly to kinetic states in class c-1
(unlikely, backward movement). An equivalent kinetic scheme is that of a chain with identical
units. A unit corresponds to an aggregation class, and is connected only to its right, and possibly,
left neighbor. The reaction flow along the unit chain can proceed left-to-right, or it can be
bidirectional. The chain can be, in principle, considered infinite in length.
If the complete state (kinetic state and class) at time t is known, the position at t+1 is
expected to be in the neighborhood. Within a sampling interval t, the molecule can take any
number of steps, back and forth. The probability of finding the kinesin at a further and further
position decreases (multi-) exponentially. If the sampling interval is relatively fast compared to
the kinetics, the probability becomes practically zero after a small number of steps. In other
words, at time t the probability mass function (pmf) is concentrated in the kinetic states of class c.
After one sampling interval, the pmf would have spread into the neighborhood of c, i.e. c+1, c+2
etc. Any position would have a finite occupancy probability, but it would be practically zero for
positions sufficiently far from c.
One cannot help from making the analogy with the passing of an impulse signal through
a filter. After passing through the filter, the signal is spread. Although a theoretically infinite
number of digital FIR coefficients would be necessary to describe the response, a finite number
170
gives adequate precision. The role of the filter coefficients is played here by the transition
probability matrix A. The impulse signal represents the state probabilities vector at time t and the
response is the state probabilities vector at time t+1 (the T symbol denotes transposition):
Pt 1  Pt  A
T
T
[5.1.1]
A truncated, finite size Atr matrix can replace the infinite A matrix. The truncated Atr is,
obviously, obtained from a truncated Qtr:
Atr  eQ
tr
t
[5.1.2]
The truncated Qtr is built from a small, finite number of units, and it is block diagonal.
The diagonality results from the connectivity: only adjacent units are directly connected (uni- or
bidirectionally). The transitions between kinetic states insides a unit c, as well as jump transitions
from unit c to c+1 are the same for any c. Therefore, the blocks of the Qtr matrix, from the topleft to the bottom-right corner, are identical:
no-jump:
qic, jc  qic k , jc k
forward jump: qic, jc1  qic k , jc k 1
qic, jc k  0
k 
k 
i, j , c, k  2
[5.1.3]
[5.1.4]
[5.1.5]
The effects of truncation on the Atr matrix should be minimal in the center block, if the
process is bidirectional, or in the top-left corner, if the process is unidirectional. Therefore, if r is
the truncation order, 2r+1 or r units are necessary, depending on reversibility.
171
Figures [5.1.1] and [5.1.2] show an example of a unidirectional chain, and the calculated
Qtr and Atr matrices, of different truncation orders. Only four units are represented, but the chain
is potentially infinite, to the left and to the right. There is only one kinetic state per unit. The
forward rate is 0.25 s-1. In Figure [5.1.1], the Qtr is formed by a virtual “copy” operation, from the
Q matrix of a larger (infinite) model. By definition (see Chapter 2), the sum of each row of a Q
matrix must be zero. When formed by “copying” from a larger model, the resulting Qtr matrix no
longer has this property. Consequently, the truncated Atr matrix is not row stochastic, but has the
property:
aijtr  akltr if
 j  i   l  k  .
[5.1.3]
In Figure [5.1.2], the Qtr is formed as the Q matrix of a finite model. For example, the
Qtr of truncation order 4 is the actual Q matrix of the model in Figure [5.1.1], A. The sum of each
row remains zero (notice the last row having all elements zero). Consequently, the truncated Atr
matrix is row stochastic. Although it no longer has the property [5.1.3], it has a good
approximation of it. More details about infinite-state Markov models are available, for example,
in (Kemeny et al, 1966).
172
A
1
0.25
0.0
2
0.25
0.0
3
0.25
0.0
4
B
0.
0. 
-.25 .25




0. 
 0. -.25 .25

Q := 

0. -.25 .25 
 0.




0.
0. -.25
 0.
.882497 .110312 .006895 .000287




.882497 .110312 .006895
 0.

A := 

0.
.882497 .110312
 0.




0.
0.
.882497
 0.
C
0.
0.
0.
0. 
-.25 .25




0.
0.
0. 
 0. -.25 .25




0. -.25 .25
0.
0. 
 0.

Q := 

0.
0. -.25 .25
0. 
 0.




0.
0.
0. -.25 .25 
 0.




0.
0.
0.
0. -.25
 0.
-5

.882497 .110312 .006895 .000287 .8977 10


 0.
.882497 .110312 .006895 .000287


 0.
0.
.882497 .110312 .006895
A := 
 0.
0.
0.
.882497 .110312


 0.
0.
0.
0.
.882497


0.
0.
0.
0.
 0.
-6
.2244 10 


-5
.8977 10 

.000287 

.006895 

.110312 

.882497 
Figure [5.1.1]. A unidirectional chain, with one kinetic state. A Model with four units. B Qtr and Atr of
truncation order 4. C Qtr and Atr of truncation order 6. Sampling time 0.5 s. The Qtr matrices are
“copied” from an infinite model. The Atr matrices are calculated with Maple, and are not row-stochastic
(the sum of each row is not 1), but have property [5.1.3].
173
A
1
0.25
0.0
2
0.25
0.0
3
0.25
0.0
4
B
0.
0. 
-.25 .25




 0. -.25 .25 0. 

Q := 

0. -.25 .25
 0.




0.
0.
0. 
 0.
.882497 .110312 .006895 .000296




.882497 .110312 .007191
 0.

A := 

0.
.882497 .117503
 0.




0.
0.
1 
 0.
C
0.
0.
0.
0. 
-.25 .25




0.
0.
0. 
 0. -.25 .25




0. -.25 .25
0.
0. 
 0.

Q := 

0.
0. -.25 .25 0. 
 0.




0.
0.
0. -.25 .25
 0.




0.
0.
0.
0.
0. 
 0.
-5

.882497 .110312 .006895 .000287 .8977 10


 0.
.882497 .110312 .006895 .000287


 0.
0.
.882497 .110312 .006895
A := 
 0.
0.
0.
.882497 .110312


 0.
0.
0.
0.
.882497


0.
0.
0.
0.
 0.
-6
.2292 10 


-5
.9206 10 

.000296 

.007191 

.117503 

1.

Figure [5.1.2]. A uni-directional chain, with one kinetic state . A Model with four units. B Qtr and Atr of
truncation order 4. C Qtr and Atr of truncation order 6. Sampling time 0.5 s. The Qtr matrices are of a
finite model. Notice the last row is all zero. The Atr matrices are calculated with Maple. As opposed to
figure [5.1.1], the Atr matrices are row-stochastic (the sum of each row is 1), and do not have property
[5.1.3].
174
5.2 Kinesin stepping model
A fluorophore can be attached to different parts of the kinesin molecule, resulting in
different characteristics of the observed signal (Selvin, 2002). A fluorophore positioned on the
body results in identical jumps and in apparent kinetics appropriate to organelle transport rates. A
fluorophore positioned on the foot results in jumps that are twice as large and twice as slow. A
fluorophore attached on the neck results in different size jumps, with the sum of two consecutive
steps equal to twice the foot-attached amplitude, and kinetics matching organelle transport. The
present work focuses on data produced with a fluorophore attached to the body, but the methods
can be extended to other experiments.
I model the measured step size as a stochastic quantity, with a Gaussian probability
distribution. I refer to the mean and standard deviation of the distribution as JumpAmp and
JumpStd, respectively. This distribution incorporates intrinsic variability in the physical position
of the probe as well as measurement errors associated with finding the centroid of the observed
intensity distribution. Counting fluorescence photons is a Poisson process, but if the number of
photons contributing to one measurement is large enough, the Poisson will be approximated by a
Gaussian. I will refer to the standard deviation of the measurement as MeasStd. I consider
MeasStd to be constant in time and position.
5.3 Algorithms for staircase data
I divide the staircase data analysis into two parts. The first part idealizes the data and reestimates jump distance parameters. The second part estimates the kinetics from the idealized
data. The next sections describe the two algorithms, one for idealization and the other for kinetics.
175
5.3.1 Position estimation – data idealization
The idealization is a general procedure by which the useful signal is extracted from noise.
The Viterbi algorithm (Viterbi, 1967; Forney, 1973) and the Forward-Backward (Baum et al,
1970) procedure idealize data produced by a hidden Markov model, when the parameters  are
known. For unknown parameters, the usual strategy is to run an Expectation-Maximization
algorithm (Dempster et al, 1977), such as Baum-Welch (Baum et al, 1970) or Segmental k-means
(Rabiner et al, 1986). An expectation step (Forward-Backward or Viterbi) performs state
inference and computes sufficient statistics, given current parameters. A maximization step
follows, re-estimating parameters from sufficient statistics. The two steps are iterated until a
convergence criterion is reached. An Expectation-Maximization algorithm is guaranteed to
converge to a local maximum of the likelihood function (Baum et al, 1970; Dempster et al,
1977).
For ion channels, the goal of idealization is to classify each data point to a conductance
class, with maximum probability. For kinesin staircase data, each point is similarly assigned to a
step position, or level, where each level corresponds to a defined position along the microtubule.
Thus, idealization eliminates the position uncertainty, but for each position the kinetic state of
kinesin is still undetermined, since aggregated states cannot be distinguished experimentally.
Consequently, for the purpose of kinetics, the likelihood function has to be marginalized over all
possible state sequences.
Several reasons prevent a direct application of a standard Markov state inference
algorithm to staircase data. The number of jumps in a record and, consequently, the number of
Markov states/classes, are not known a priori. The magnitude of a single jump is not known, and
it is not necessarily constant but may probabilistically differ from step to step. The variability of
the jump is either intrinsic to the physical process or it is an experimental artifact. Data may be
176
corrupted by steady, slow drift, caused by movement of the microscope stage relative to the CCD
camera. For this reason, even for exactly known jump amplitudes, the number of levels present in
data still cannot be calculated from the excursion of the signal between a minimum and a
maximum value. Due to the probabilistic nature of the jump and to the possible drift, the same
number of levels can result in different signal excursions.
A possible solution is to consider several values for the level count, idealize data for
each, and choose the number giving the most likely idealization, according to some criterion.
However, this method requires a possibly very large A matrix, and must be repeated several
times. A better solution yet is to devise a tracking procedure with dynamically adjustable
reference, for detection of jumps, or, equivalently, levels. This is the procedure I propose next.
In principle, a jump detector must use a reference level. Inaccuracy in the current
JumpAmp estimate, mainly, and baseline drift render a detection procedure with static reference
ineffective. As tracking progresses, errors accumulate and false negative or false positive
mistakes in level detection are made. If, on the other hand, the reference level is adjusted
dynamically during tracking, the cumulative error is minimized. Furthermore, the reference level
can be used as the center unit of a truncated model, as described in the previous section. The
small, truncated model is translated along the data, and moved to the right (or to the left)
wherever a jump is detected. In effect, the algorithm cuts a subset of likely paths from the much
larger (infinite) set of possible paths. An auxiliary quantity can simply register, for each data
point, the offset of this truncated model.
I implemented this approach in a procedure similar to the Forward-Backward-Kalman
algorithm from Chapter 4, with the differences of using a truncated model and an offset variable.
I also tried a Viterbi-Kalman-like variant, which turned out to be unsatisfactory for the signal-tonoise ratio of the available experimental data. The procedure can be added to a Baum-Welch-type
algorithm, with appropriate modification of the re-estimation formulae. The next two sections
describe the new state inference procedure and the re-estimation.
177
5.3.1.1 Modified Forward-Backward-Kalman procedure for staircase data
Without loss of generality, I describe the procedure as it applies to a model with one
kinetic state per unit. Computation of sufficient statistics in the infinite state space is neither
possible nor necessary. Due to the properties discussed before, calculations can be done with a
truncated model. To accomplish this, a mapping between the states of the truncated model and the
states of the infinite model is needed. The index of the infinite model states is an integer number,
taking values between minus and plus infinity. The index of the truncated model states takes
values between 1 and 2r+1, where r is the truncation order (for more than one state per unit, the
state vector has 2rn+1 entries, where n is the number of states per unit). By definition, the infinite
model state with index zero corresponds to the starting level. I call this the reference state
(RefState). Initially, the middle state in the truncated model, with index (r+1), is mapped into
RefState. As the state inference procedure progresses forward, the mapping changes as new
levels are detected.
The mapping is realized with an additional quantity,  (offset), which is a vector of the
same length as the data. For each data point, t indicates which state in the infinite model the
middle state of the truncated model is mapped. At the same time, t is chosen such as to make the
middle truncated state be the most likely filtered estimate, that which has the maximum entry in
the truncated  vector.
The combination of the state index in the truncated model, i and the  value maps into a
state index, l in the infinite model:
i
tr

  l
178
[5.3.1.1.1]
The  of the first point, 0 is initialized to -(r+1) and thus the middle truncated state, of index
equal to (r+1), points to RefState:
0  r  1
[5.3.1.1.2]
r  1  0  0
[5.3.1.1.3]
The state vectors (,  and ) are truncated to a 2r+1 size. They can be interpreted as
potentially infinite in size but having only 2r+1 non-zero elements, for any t. The entries of the
initial probabilities vector  (of the same 2r+1 size) are all zero, except for the middle state (or
states, if more than one per unit) which is one:
 i  1, i  r  1
 j  0, j  r  1
[5.3.1.1.4]
The amplitude of the RefState level is equal to BaseAmp. The other levels’ amplitudes
are calculated relative to RefState and BaseAmp:
Ampl  BaseAmp  l  JumpAmp
[5.3.1.1.5]
BaseAmp is a parameter of the idealization and can be initialized, for example, by manual
selection. MeasStd can be initialized in the same way. The emission probabilities for any level
are calculated as:
b l  y t   N  Ampl , MeasStd  | yt
179
[5.3.1.1.6]
The ’s are initialized:
 0i   i  bi  y0 
0
[5.3.1.1.7]
and they are all zero, except for i = r+1. For any 0<t≤T, the recurrence of  is:
2 r 1

 i1

 t j1    ti  aij   b j 
t 1
 yt 1 
[5.3.1.1.8]
The effects of truncation on the A matrix can be minimized by choosing the appropriate
entries (the left-top corner or the middle, for left-right or reversible models, respectively):
left-right:
aij  a1tr,1( j i )
reversible:
aij  artr1,r 1( j i )
[5.3.1.1.9]
[5.3.1.1.10]
The truncated model must be centered on the current level. Thus the index of the
maximum element of  is searched for. If the index of the maximum is not that of the middle
state (r+1), then o is adjusted to make it so.  is then re-calculated, and this is repeated until the
condition is fulfilled.
A backward step follows, which calculates  and . Certainly, there is no guarantee that
the same state that maximized  will maximize  too. However, it is very likely that the
maximum will at least be within the truncated state-space. The state which maximizes  is used
for level assignment in order to obtain the idealization:
180
 
Levelt  arg max  ti   t
[5.3.1.1.11]
i
Unfortunately, the matter is complicated by imprecise knowledge of BaseAmp and, more
importantly, JumpAmp. Consequently, even in the absence of actual baseline drift, each new
level detected in the data determines an increase in the prediction error of that level’s amplitude.
After a number of steps, the state inference and level assignment fail. A simple yet effective
solution to this problem is to assimilate the uncertainty in BaseAmp and JumpAmp to a virtual
baseline drift. This drift is modeled with a linear Gaussian process, as in equation [4.2.6], used to
describe baseline artifacts in single channel data. The baseline offset at each point is estimated
with a Kalman filter. The data value at each point is corrected by subtracting the estimated
baseline from it:
yt : yt  bt
[5.3.1.1.12]
The net result of this procedure is that the level prediction error is kept under control.
Since the so-estimated baseline does not correspond to a real drift but it is rather a computational
trick, it should be ignored for the purpose of re-estimation. Presence of true baseline drift is a
different matter, which I do not address here.
The resulting algorithm is equivalent to the initialization step of Forward-BackwardKalman, presented in Chapter 4. Obviously, the expectation-maximization steps may be run as
well, but preliminary results (not shown) did not indicate a significant improvement. Further
work and more experimental data are necessary in order to clarify this issue.
Another modification to the Forward-Backward-Kalman algorithm resulted in dramatic
improvement. Its purpose was to accelerate adaptation to a new level, but to do so only in the
vicinity of detected jumps. The modification is simple. The amplitude of a jump is modeled as a
181
random Gaussian variable, with JumpAmp mean and JumpVar variance. Thus, if a jump is
detected between the previous and the current points, the variance of a jump, JumpVar must be
added to the baseline variance. If not one, but k levels are jumped between two consecutive
points, then k times JumpVar is added. The corrected BaseVar is easily calculated with help
from the  variables.

BaseVar : BaseVar  JumpVar    ti   t j1   j  i   k v
i

[5.3.1.1.13]
j
Clearly, when the state maximizing (with high score) t+1 is equal to the state maximizing
(with high score) t; no variance is added, as no jump is indicated. The absolute index does not
matter. If the difference between the two maximizing states is +/-1, then one jump variance is
added, and so on. kv is an empirical factor, which preliminary tests indicated to be necessary. The
best results (not shown) were obtained with kv = 0.1.
5.3.1.2 Re-estimation
The modified Forward-Backward-Kalman described in the previous section computes the
most likely sequence of states, given a guessed set of parameters. The parameters must now be reestimated from the inferred state sequence. The re-estimation must update BaseAmp, JumpAmp,
JumpStd, MeasStd, and, optionally, the Atr matrix.
The first step of re-estimation is to find the number of levels present in the data. Two
auxiliary quantities are calculated, MinLevel and MaxLevel. They represent the lowest and the
highest levels in the data. MinLevel could actually be higher or lower than zero. For each of
these levels, the mean amplitude, Ampl, is calculated:
182
T
Ampl 
y
t 0
T
t
  tl t

t 0
[5.3.1.2.1]
l t
t
It is clear from the expression above that the baseline does not intervene in re-estimation. Of
course, if a data point is not in the truncated space centered on level l, it is excluded from
calculation of the corresponding amplitude. In other words, not being in the truncated space
implies a  equal to zero.
BaseAmp is re-estimated as the amplitude of the zero level:
BaseAmp  Amp0
[5.3.1.2.2]
Once the level amplitudes are known, the standard deviation of the measurement noise, MeasStd
is updated:
 T MaxLevel l t
2 
    t   yt  Ampl  

MeasStd   t 0 l MinLevel
T MaxLevel


l t
t




t 0 l  MinLevel



1
2
[5.3.1.2.3]
The re-estimation of JumpAmp is possible in several ways. For example, JumpAmp
can be calculated as the average difference between any two consecutive levels. Alternatively,
JumpAmp can be calculated from the difference in amplitude of any two levels. These quantities
are evidently divided by the difference in level. I tested both methods but, as it gives much better
results, I write expressions for the second method only:
183
 Ampl  Ampk 
JumpAmp  Average

l k


k , l | k  l 
[5.3.1.2.4]
The averaging in [5.3.1.2.4] must be somehow weighted, so that less represented levels
(or possibly not represented at all) get correspondingly less weight and vice-versa: better
represented levels get more. The weighting factor of a level is inversely proportional to its
variance, which in turn is inversely proportional to the number of points assigned to that level.
The point assignment is given by . The weighting factor of a difference of levels is inversely
proportional to the sum of the two levels’ variances. Thus, using the weighting factors wkl,
JumpAmp is calculated as:
MaxLevel MaxLevel
JumpAmp 


 Ampl  Ampk 
wkl 

lk


k  MinLevel l  k 1
MaxLevel MaxLevel

w
k  MinLevel l  k 1
wkl



k t
t
t
kl
   tl t
t
[5.3.1.2.6]
   tl t
k  t
t
[5.3.1.2.5]
t
t
The standard deviation of the jump, JumpStd, must, however, be calculated from the
differences in consecutive levels amplitudes:
MaxLevel 1
JumpStd 
 w  Amp
l , l 1
l  MinLevel
 Ampl   JumpAmp
2
l 1
MaxLevel 1
w
l , l 1
l  MinLevel
184
[5.3.1.2.7]
The re-estimation of the truncated Atr matrix is slightly different from the standard
procedure. tij is the conditional probability that the state at time t is i and the state at time t+1 is j,
given the whole observation sequence (see expression [2.3.1.1.12]). I calculate  in a way that
satisfies equation [5.1.3]:
 tkl 
2 r 1

i
t
 aij  b j  yt 1    t j 1
i 1 , j 1
1i t  2 r 1
1 j  t 1  2 r 1
 j i  t 1 t l k 
[5.3.1.2.8]
k , l | 1  k  2r  1,1  l  2r  1
t | 0  t  T 
The Atr is re-estimated in the usual way and, due to normalization, the property [5.1.3] may be
lost.
5.3.2 Kinetic parameters estimation
I introduce here a maximum likelihood method which estimates kinetic parameters from
staircase data, using truncations of infinite-state hidden Markov models. It is, in principle,
identical to the MIP algorithm, with certain modifications in the likelihood function. The method
is applied to kinesin single molecule fluorescence data. The algorithm uses the idealization
obtained with the modified Forward-Backward-Kalman procedure. It estimates the rate constants
of conformational transitions and the rate constants of the jumps between positions. The
algorithm provides standard deviations of the kinetic estimates and is capable of fitting multiple
data sets across different experimental conditions. As with MIP, it permits the user to impose
linear constraints on the estimated rate constants. I subsequently refer to this algorithm as MIPS
(Maximum Idealized Point likelihood of Staircase data).
185
At each position, the kinesin molecule may exist in several kinetic states, connected by
transition rates. The jump from one position to the next (or previous) may originate, and may end,
in one or several kinetic states. The jump rates corresponding to different combinations of initial
and final states may be identical or different. Since the kinetic scheme is identical at each
position, a description of the kinetic model requires the conformational kinetic scheme, at one
position, and the jumping scheme. A minimal representation is possible with only two units.
Figure [5.3.2.1] gives an example of a conformational kinetic scheme with two states. The jump
between two consecutive positions can only take place from state B to state A. The kinetic
parameters are k1, k2, j1 and j2.
xt
k1
A
B
k2
j2
j1
xt+1
k1
A
B
k2
j2
j1
Figure [5.3.2.1]. Kinesin kinetic model, with two states per position. Two units are shown, as part
of a larger (infinite) chain. The squares A and B represent the kinetic states within the same unit,
connected by transition rates k1 and k2, identical for any unit. State B in one unit is connected to
state A in the next unit by jump rates j1 and j2. The model has four parameters: k1, k2, j1 and j2.
5.3.2.1 Parameterization and constraints
The parameterization and the constraints take the same form as presented in Chapter 3,
and are not repeated here.
186
5.3.2.2 Likelihood function
First, the idealized data is differentiated. The resulting data points are 0, if there is no
jump, or k, if there is a k-order jump between two consecutive observations (or –k if the jump is
downwards, for reversible models). In the following, for easier understanding and without loss of
generality, I assume that the maximum jump detected in the data is of order +/-1 (no multipleposition jumps). I also suppose the model is reversible, and that the Qtr and Atr matrices have a
truncation order r = 1. Therefore, the truncated model has only three units (left, middle and right).
All the matrices and vectors used below have three blocks, corresponding to the three units.
The likelihood of a sequence of idealized staircase data points, in matrix form, is:


L  P0  B0  S  A  B1  S  A  B2  S  A  B3  SA  BT  S  1
T
[5.3.2.1.1]
The likelihood expression is quasi-identical to that in [3.1.2.1], used by MIP. I discussed
there the interpretation of the function and the significance of each term, thus I point here only the
differences. P0T is the (transposed) initial state probabilities vector. All its entries are zero except
those corresponding to the middle unit, which are non-zero and sum to one.
Bt is a diagonal matrix, whose diagonal entries, block-wise, are either 0 or 1. For points
corresponding to a jump up (differentiated idealized data is 1), the left block entries are 1 and all
the others are zero. For points corresponding to a jump down (differentiated idealized data is -1),
the right block entries are 1 and all the others are zero. For points corresponding to no jump, the
middle entries are 1 and the others 0. At each point t, the appropriate Bt matrix is plugged into the
likelihood expression. The B and S post-multiplying P0 can actually be removed from the
expression, as the first data point is always 0 by definition.
187
S is a matrix that sums, block-wise, the state probabilities, moves the sum into the middle
block and clears the left and right blocks. Its purpose is to reset the state probabilities vector. A
brief explanation is in order. The likelihood calculation starts with the initial state probabilities
P0, which is the a priori knowledge of the initial state occupancies. Then, for each time t, Pt+1 is
first predicted by post-multiplying Pt with the truncated Atr matrix. The prediction is then
updated using the evidence (the data), by post-multiplication with the diagonal matrix Bt+1. The
updated Pt+1 is the a posteriori estimate. The entries in the state vector corresponding to the states
for which there is no evidence in the data are zeroed. The updated state probabilities vector must
now be reset, so that only the middle block states are occupied and the others are zero. All that
matters is the spread of probabilities between two data points, with respect to their relative levels.
The absolute level of the current or of the next point is irrelevant. For example, if from time t to
time t+1 there is a jump from level 13 to level 14, only a relative jump of +1 is registered. This is
achieved by the S matrix. Of course, computationally the same result can be achieved without the
S matrix, but this trick permits a consistent matrix form of the likelihood function and of its
derivatives, as indicated below. Finally, the state probability vector at the last point is postmultiplied with a column vector of ones, which sums the probabilities of each state. This sum is
equal to the likelihood. Obviously, the more the prediction matches the evidence, the higher the
value of the likelihood.
The structure of the resetting matrix S for a reversible model of truncation order r=1, with
two kinetic states per position is shown in Figure [5.3.2.1.1]. The function of the S matrix is
immediately obvious from the figure. The truncation order of the model should be at least equal
to the order of the maximum jump detected in the actual data. For example, if the maximum jump
is 3, then the model has 2*3+1 = 7 units, and all matrices are sized accordingly. If the model is
left-to-right, then the left units can be dropped, since there is never any evidence in the data for
backwards steps. Thus, 3+1=4 units are enough. Other problems posed by irreversible schemes
are discussed in later sections.
188
One must notice that in the computation of the likelihood function, the state probabilities
decrease rapidly, leading to numerical underflow. A way to prevent this is by re-normalizing the
state vector at each point and using the normalized vector for computation of the next point. The
likelihood is calculated as the product of the norms for each point. Naturally, since this quantity
will also be extremely small, the logarithm of the likelihood is calculated instead, as the sum of
the logarithms of the norms.
A
V := [.01 .03 .2 .11 .3 .25]
B
 0.


 0.


 0.
S := 
 0.


 0.


 0.
0.
1.
0.
0.
0.
0.
1.
0.
0.
1.
0.
0.
0.
0.
1.
0.
0.
1.
0.
0.
0.
0.
1.
0.
0. 


0. 


0. 


0. 


0. 


0. 
C
V_S := [0. 0 .51 .39 0 0]
Figure [5.3.2.1.1]. The function of the resetting matrix S, for a reversible model of truncation order
r=1. A State vector before multiplication by S. B S matrix. C State vector after post-multiplication
with S. Block-wise, the state probabilities are summed, moved to the center block (yellow) and the
left and right blocks (gray) are cleared.
5.3.2.3 Computation and derivatives of the likelihood function
The computation and the derivatives of the likelihood function are similar to those
described in Chapter 3, with the difference of introducing the S matrix. I omit the time
dependency of the A matrix, and present only the final results.
189
The recursive forward and backward variables  and  are:


αt  P0  B0  S  A  B1  S  A  B2  S  A  B3  SA  Bt  S
T
T
[5.3.2.3.1]
α t 1  αt  A  Bt 1  S
[5.3.2.3.2]
α 0  P0  B0  S
[5.3.2.3.3]
βt  A  Bt 1  S A  BT  S   1
[5.3.2.3.4]
βt  A  Bt 1  S  βt 1
[5.3.2.3.5]
βT  1
[5.3.2.3.6]
T
T
T
T
The log-likelihood LL is calculated from the normalization factors:
T
LL   log st
[5.3.2.3.7]
t 0
The derivatives of the log-likelihood are:
T
 T 1   T A tr
 
 log L P0

 B 0  S   β 0   α t 
 Bt 1  S  βt 1 
x
x
x
t 0 

[5.3.2.3.8]
The details of the calculation are the same as in Chapter 3 and are not presented here.
5.3.2.4 Speeding up the computation
190
I use the same technique to speed-up the computation as described in Chapter 3, by the
use of pre-multiplications.
5.4 Results
I tested the algorithm with simulated and experimental data. The experimental recordings
are not actually produced by kinesin experiments, but utilize the same optical equipment and
generate the steps by stage translation controlled by a computer (Selvin, 2002). In order to
simulate
staircase
data,
I
modified
the
single
channel
simulator
within
QuB
(www.QuB.buffalo.edu). The staircase simulator is turned on as a new option in QuB. The user
needs to specify two (for irreversible) or three units (for reversible models). Each time a transition
occurs between states in different units, the current level is incremented by +/-JumpAmp,
depending on whether the jump is upward or downward. Then the state vector is reset back into
the middle unit. Of course, several jumps may, probabilistically, occur between two simulated
points. I have used both irreversible (left-right only) and reversible models. The irreversible
models have the right-left jump rates equal to zero.
The experimental data files I had available for analysis (Selvin, 2002) are sampled at
500ms, and are between 46 and 176 (46, 47, 51.5, 55.5, 96.5, 100.5, 109.5, 176) seconds long.
Therefore, I have simulated data with the same sampling frequency, and duration of 90 seconds,
trying to be roughly in the same signal-to-noise range. I start the results section by testing the
kinetics estimation with simulated data. With the kinetics performance so established, I test the
idealization procedure. I idealize both experimental and simulated data. The idealization of the
simulated data is compared against the true idealization. The kinetics estimated from idealization
is compared against the kinetics of the true idealization. The idealization of experimental data is
191
visually inspected. Both the kinetics estimation and the data idealization algorithms are
implemented as features of the QuB software (www.QuB.buffalo.edu).
5.4.1 Kinetics estimation of simulated data
The kinetics estimation was tested mostly with simulated data produced by a simple
irreversible model, with one kinetic state per position. I checked the following characteristics of
the algorithm: bias of the estimate, standard error of the estimate, effect of Atr matrix truncation
order, effect of jump truncation and effect of return rate. If the algorithm is not biased, then the
estimated rates should equal the true values when the amount of data is infinite. The algorithm
provides a standard deviation of the estimate, which should be interpreted as the degree to which
the estimate can be trusted, rather than as how close the estimate is to the true values. It is of
course desirable that the standard error of the estimate decreases rapidly with the amount of data.
The model of the process is, in principle, infinite. It must, however, be truncated to a
finite size. The truncated Atr matrix introduces some errors, which can be lessened with a larger
truncation order. Reducing truncation results in computational problems, as larger matrices are
more difficult to handle. The truncation order must be chosen to be at least equal to the magnitude
of the highest occurring jump, higher if possible. If, for some reason, the order must be smaller
than the highest jump, then differentiated idealized data must be truncated to that order. The jump
truncation is therefore another source of errors, which can be avoided by faster sampling at higher
bandwidth (however, higher bandwidth results in higher noise, which may be even less desirable).
I refer to the model truncation order as TruncOrder, and to the jump truncation order as
MaxJump. Both are parameters of the optimization algorithm.
The effect of the return rate must be evaluated because of an unfortunate limitation of the
matrix exponentiation procedure. The spectral expansion method (Colquhoun and Sigworth,
1995a) requires that the eigenvalues of the Q matrix be distinct, which is almost always the case
192
for ion channels. The irreversible models used for kinesin data have confluent (multiple)
eigenvalues, though. It would be too complicated to devise a different exponentiation technique,
so I approximated the return rates with a small finite value, rendering the model very slowly
reversible. The return rates are chosen such as to give distinct eigenvalues, yet to cause minimal
errors in the estimated rates. Obviously, if the model is known to be irreversible, these rates are
constrained to be fixed and equal to a small constant. For example, the Qtr matrix in Figure
[5.1.2], B, gives three confluent eigenvalues of -0.25, and one zero. By making the backward rate
0.001s-1, the three non-zero eigenvalues are ~(-0.2734, -0.2510, -0.2286). With a backward rate of
0.0001 s-1, the three non-zero eigenvalues are ~(-0.2572, -0.2501, -0.2430), close enough to the
confluent eigenvalues yet distinct.
For an irreversible model, the only parameters to be optimized are the forward rates.
They are subject to linear constraints: the correspondent transition and jump rates in and between
different units are identical. For example, for the model in Figure [5.3.2.1], truncated to four
units, there are only four free parameters: k1, k2, j1 and j2, if reversible, and only k1, k2 and j1, if
irreversible.
The model used for simulation is shown in Figure [5.4.1.1].
1
0.25
0.0
2
Figure [5.4.1.1]. Irreversible model used for simulation, with one kinetic state per unit. Only two
units are shown.
5.4.1.1 Bias of the estimate
The parameters used for simulation are in Table [5.4.1.1.1]. Ten data sets were generated
and the estimated rate for each is compared with the rate calculated from the true number of
193
jumps, returned by the simulator. The true rate is equal to the number of jumps divided by the
duration of the data.
PtsPerSeg
SegCnt
2400
1000
Sampling
[msec]
62.5
ForRate
[sec-1]
0.25
BackRate
[sec-1]
0
Table [5.4.1.1.1]. Simulation parameters. PtsPerSeg is the number of points per segment. SegCnt is
the number of segments. ForRate and BackRate are the forward and the backward rates, respectively.
The parameters of the optimization (kinetics estimation) are in Table [5.4.1.1.2]. A model
truncated to 5 units was used (Figure [5.4.1.1.1]) and the jumps in the data were truncated to 3
(although it so happened that no jump higher than 3 was simulated).
BackRate
0.0001
TruncOrder
5
MaxJump
3
Table [5.4.1.1.2]. Optimization parameters. BackRate is the backward rate, constrained to be
constant. Any jump of order higher than MaxJump is truncated to MaxJump.
1
0.25
0.0
2
0.25
0.0
3
0.25
0.0
4
0.25
0.0
5
Figure [5.4.1.1.1]. Model used for optimization, truncated to five units. The backward rate is actually
0.0001s-1 (but displayed as 0.0.
Figure [5.4.1.1.2] presents the results of kinetics estimation. Clearly, there is no obvious
bias in the rate estimate.
194
Forward rate
0.2495
0.2490
K12
TrueK12
0.2485
0.2480
0.2475
0
2
4
6
8
10
Data set
Figure [5.4.1.1.2]. Bias of the estimate. The forward rate estimated with the algorithm is plotted in
red, while the true rate, calculated by dividing the known number of jumps to the data duration, is
plotted in blue. Most of the data overlap so that the two colors cannot be distinguished.
5.4.1.2 Precision of the estimate
I used the same procedure as in the previous section. Data sets were generated with 1, 10,
100 and 1000 segments. The standard error was calculated for each data set. Figure [5.4.1.2.1]
shows the logarithm of the standard error of the estimate to be a linear function of the logarithm
of the number of data points.
StdErr( Forward Rate)
100
10
1
0.1
0
1
2
3
log10 (Segment count)
Figure [5.4.1.2.1]. Precision of the estimate. The standard error of the forward rate estimate is
plotted as a function of the amount of data, in a log-log graph. The precision is that expected for
Gaussian errors.
5.4.1.3 Effect of jump truncation
195
Statistically, a fraction of dwells will be shorter than the sampling time, resulting in
apparent higher order jumps. All jumps of order higher than MaxJump are converted (made
equal) to MaxJump. The consequence is that the apparent number of transitions in the data is
lowered and thus the apparent estimated kinetics is slower. In general, the jump truncation can be
avoided by increasing the model dimension, but undesired numerical issues occur in the
computation of the matrix exponential and its derivatives.
To test the effect of MaxJump, I have simulated data with a slower sampling, to increase
the frequency of multiple jumps (Table [5.4.1.3.1]). I estimated kinetics with three values for the
MaxJump parameter, 1, 2 and 3 (Table 5.4.1.3.2]). The estimated forward rate is plotted as a
function of the number of jumps present in the data which had to be truncated. It appears that
truncating jumps has a slight effect to decrease the rate constant as expected for missed events.
PtsPerSeg
SegCnt
300
1000
Sampling
[msec]
500
ForRate
[sec-1]
0.25
BackRate
[sec-1]
0
Table [5.4.1.3.1]. Simulation parameters.
BackRate
0.0001
TruncOrder
5
MaxJump
variable
Table [5.4.1.3.2]. Optimization parameters.
196
Forward rate
0.25
0.2
0.15
K12
0.1
TrueK12
0.05
0
8.00E-04
0.025
0.6
% Truncated jumps
Figure [5.4.1.3.1]. Effect of jump truncation. The estimated forward rate is plotted against the
fraction of jumps that were truncated. The estimated rate is not very sensitive to the jump
truncation.
5.4.1.4 Effect of A matrix truncation order
To determine the effect of Atr matrix truncation on the estimated kinetics, I used three
truncation orders: 4, 5 and 6 (the three models in Figure [5.4.1.4.1]). To make the comparison
meaningful, I limited the maximum jump to 3 and I chose simulation conditions such as to lower
the number of higher order jumps (Table [5.4.1.4.1]). Table [5.4.1.4.2] has the optimization
conditions.
The estimated forward rate constant is identical for all truncation orders (up to 6 digits)
and I do not show the results. The Atr matrix truncation order has practically no effect, at least for
this simple model, and should be chosen as the minimum size that can accommodate the highest
jump.
PtsPerSeg
SegCnt
2400
1000
Sampling
[msec]
62.5
ForRate
[sec-1]
0.25
Table [5.4.1.4.1]. Simulation parameters.
197
BackRate
[sec-1]
0
BackRate
0.0001
TruncOrder
variable
MaxJump
3
Table [5.4.1.4.2]. Optimization parameters.
A
1
0.25
0.0
2
0.25
0.0
3
0.25
0.0
4
0.25
0.0
5
B
1
0.25
0.0
2
0.25
0.0
3
0.25
0.0
4
C
1
0.25
0.0
2
0.25
0.0
3
Figure [5.4.1.4.1]. Optimization models with different truncation orders. A Truncation order 5. B
Truncation order 4. C Truncation order 3.
5.4.1.5 Effect of backward rate
Ideally, the return rate for the irreversible, left-to-right model should be zero. To avoid
confluent eigenvalues in the Qtr matrix, the return rate was assigned a small value and
constrained to be fixed with respect to optimization (it is not a free parameter). Table [5.4.1.5.1]
has the simulation and Table [5.4.1.5.2] has the optimization parameters. The model in Figure
[5.4.1.4.1], A, is used for optimization. 10 data sets are simulated. The forward rate is estimated
for each, for four values of the backward rate (0.1, 0.01, 0.001 and 0.0001s -1). For each of these
values, the average forward rate is calculated, and then plotted as a function of the backward rate.
Obviously, the larger the value of the backward rate, the larger the value of the estimated
forward rate. However, the error is very small, even for relatively large values of the backward
rate.
198
PtsPerSeg
SegCnt
2400
1000
Sampling
[msec]
62.5
ForRate
[sec-1]
0.25
BackRate
[sec-1]
0
Table [5.4.1.5.1]. Simulation parameters.
BackRate
variable
TruncOrder
5
MaxJump
3
Table [5.4.1.5.2]. Optimization parameters.
0.249
Forward rate
0.2485
0.248
0.2475
K12
TrueK12
0.247
0.2465
0.246
1E-04
0.001
0.01
0.1
Backw ard rate
Figure [5.4.1.5.1]. Effect of backward rate approximation. The estimated forward rate is plotted in
red, versus the backward rate. The true forward rate, calculated by dividing the number of jumps to
the data duration, is plotted in blue. The estimated forward rate is relatively insensitive to the value
of the backward rate.
5.4.1.6 Log-likelihood surface
I test here the shape of the log-likelihood surface as a function of the forward rate
constant. Table [5.4.1.6.1] has the simulation and Table [5.4.1.6.2] has the optimization
parameters. The model in Figure [5.4.1.4.1], A, is used for optimization. 5 data sets are simulated.
The forward rate is estimated for each, and the average is assigned to k12*. The log-likelihood
corresponding to k12* is assigned to LL*. The log-likelihood corresponding to different values of
k12 is assigned to LL. The difference LL-LL* is plotted in Figure [5.4.1.6.1] against the relative
199
difference (k12-k12*)/k12*. The likelihood surface is smooth and has only one maximum along a
large range of values (Figure [5.4.1.6.2]).
PtsPerSeg
SegCnt
2400
1000
Sampling
[msec]
62.5
ForRate
[sec-1]
0.25
BackRate
[sec-1]
0
Table [5.4.1.6.1]. Simulation parameters.
ForRate
variable
BackRate
0.0001
TruncOrder
5
MaxJump
3
Table [5.4.1.6.2]. Optimization parameters.
250
LL-LL*
200
150
100
50
0
-1e-1
-1e-2
-1e-3
-1e-4
0
1e-4
1e-3
1e-2
1e-1
(k12-k12*)/k12*
Figure [5.4.1.6.1]. Log-likelihood surface. On the x axis is the relative difference between the
forward rate k12 and the maximum likelihood forward rate k12*. On the y axis is plotted the
difference between LL for k12 and the LL* for k12*.
5.4.1.7 Reversible model
I test here the behavior of the algorithm with a one state per unit, reversible model. Since
the kinesin physical process is known to be biased in one direction, for simulation I chose the
backward rate to be ten times smaller than the forward rate. The two rates are now both free
parameters, to be optimized.
200
Figure [5.4.1.7.1] shows the reversible model used for simulation. In Figure [5.4.1.7.2] is
the simulation model as actually used in QuB. Table [5.4.1.7.1] has the simulation and Table
[5.4.1.7.2] has the optimization parameters. Figure [5.4.1.7.3] shows the reversible model used
for optimization, of truncation order 5.
10 data sets were simulated. The estimated forward and backward rates for each data set
are plotted in Figure [5.4.1.7.4]. Both rate estimates are precise and unbiased.
0.25
1
2
0.03
Figure [5.4.1.7.1]. Reversible model used for simulation. The backward rate is actually 0.025 (but
displayed as 0.03).
0.0
3
0.03
0.25
1
2
0.0
Figure [5.4.1.7.2]. Model used for simulation in QuB. The backward rate is actually 0.025 (but
displayed as 0.03).
PtsPerSeg
SegCnt
300
1000
Sampling
[msec]
500
ForRate
[sec-1]
0.25
BackRate
[sec-1]
0.025
Table [5.4.1.7.1]. Simulation parameters.
TruncOrder
9
MaxJump
3
Table [5.4.1.7.2]. Optimization parameters.
201
1
0.25
0.03
2
0.25
3
0.03
0.25
0.03
4
0.25
0.03
5
0.25
0.03
6
0.25
0.03
7
0.25
0.03
8
0.25
0.03
9
Figure [5.4.1.7.3]. Model used for optimization. The backward rate is actually 0.025 (but
displayed as 0.03).
0.25
Rate
0.20
K12
TrueK12
K21
TrueK21
0.15
0.10
0.05
0.00
0
2
4
6
8
10
Data set
Figure [5.4.1.7.4]. Rate estimation for a reversible model. The estimated forward and backward
rates are plotted with red and magenta symbols, respectively. The true forward and backward rates,
calculated by dividing the known number of jumps (up and down, respectively) to the duration of
data, is plotted in blue and green, respectively. Most of the data overlap so that the two colors
cannot be distinguished, for either forward or backward rate.
5.4.2 Data idealization
The quality of idealization affects both the jump parameters and the kinetic parameters.
For staircase data, correct re-estimation of jump parameters is critical. If JumpAmp is
underestimated, more apparent jumps are detected in the data and the estimated kinetics is faster.
The opposite happens when JumpAmp is overestimated. Whether under- or overestimation
occurs, it is at least desirable that level tracking remains reasonably good, and only occasional
false positive or false negative jumps are detected.
The parameters that affect idealization are the signal-to-noise ratio and the rate of kinetics
relative to the sampling frequency (assumed to be taken at the Nyquist limit). The signal-to-noise
ratio is rigorously defined as the ratio between the power of the signal and the power of the noise.
There are three sources of noise or uncertainty: the variance of the jump (JumpStd), the variance
202
of the measurement (MeasStd) and possible baseline drift. As in Chapter 4, I use here a more
intuitive definition of SNR, namely the ratio between the average jump (JumpAmp) and the
measurement standard deviation (MeasStd):
SNR 
JumpAmp
MeasStd
[5.4.2.1]
The definition does not consider the intrinsic jump variance, and thus a non-zero JumpStd
lowers the apparent SNR. So does, of course, any drift or fluctuation in the baseline.
Experimental data were generated by Dr. P. Selvin’s laboratory with one state per unit,
irreversible models, at different signal-to-noise ratios. Without deliberately adding baseline drift,
some drift is visible. Similarly, the jumps were intended to be of constant amplitude. However,
visual inspection clearly shows variability in the jump amplitude. This is probably due to the
resolution of the digital-to-analog converter, which sends the control pulse to the piezoelectric
actuator that moves the microscope stage or to various frictional errors in the stage. Testing the
idealization algorithms demands simulated data with the same characteristics as the available
experimental records. In the next two sections I present results for idealization of simulated data
and of experimental data.
5.4.2.1 Idealization of simulated data
As the experimental data was generated with a one-state, irreversible model, the
simulation was the same (Figure [5.4.2.1.1]). One data set was generated for each of the
simulation conditions shown in Fable [5.4.2.1.1]. Figure [5.4.2.1.2] gives visual clues as to the
appearance of these simulated data sets. The seed of the random number generator used by the
simulator was the same in each experiment, so the generated state (level) sequences were
203
identical, irrespective of the parameters used in simulation. The forward rate constant was
estimated from the true idealization (an output of the simulator). The parameters of optimization
and the results are in Table [5.4.2.1.2]. The 5-unit model in Figure [5.4.1.4.1], A, was used for
optimization. The idealization and optimization parameters common to all experiments are given
in Table [5.4.2.1.3].
1
0.25
0.0
2
Figure [5.4.2.1.1]. Irreversible model used for simulation, with one kinetic state per unit. Only two
units are shown.
SNR
PtsPerSeg
SegCnt
300
300
300
300
300
300
300
300
300
300
99
99
99
99
99
99
99
99
99
99
10
10*
10**
5
5*
5**
2
2*
2**
1
Sampling
[msec]
500
500
500
500
500
500
500
500
500
500
ForRate
[sec-1]
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
BackRate
[sec-1]
0
0
0
0
0
0
0
0
0
0
JumpAmp
JumpStd
MeasStd
1
1
1
1
1
1
1
1
1
1
0.0
0.1
0.2
0.0
0.1
0.2
0.0
0.1
0.2
0.0
0.1
0.1
0.1
0.2
0.2
0.2
0.5
0.5
0.5
1.0
Table [5.4.2.1.1]. Simulation parameters.
Trunc
Order
5
Max
Jump
3
Back
Rate
0.0001
For
Rate
0.243106
Std
Dev
0.00405294
Std
Err
Trunc
Count
0
LL
-11331.82266
Table [5.4.2.1.2]. Optimization parameters and results.
204
Forward
rate
[s-1]
0.25*
Idealization
Backward
rate
[s-1]
0.01*
Max
Jump
12
Kinetics estimation
Truncation
Max
Backward
Order
Jump
rate
[s-1]
5
3
0.0001
Table [5.4.2.1.3]. Idealization and optimization parameters common to all experiments. For
experiments where all initial parameters for idealization were arbitrary, the forward rate was 1
and the backward rate was 0.1s-1.
One of the most important properties of the idealization, or of any algorithm, is the
convergence to a stable solution, i.e., self consistency. With experimental data, one does not
know the true value of the optimized parameters. The algorithm must then be started from initial,
guessed parameters. The initialization of parameters is arbitrary, and so it is important that in
general, different initial values lead to the same solution. In other words, it is desirable that the
algorithm have a wide basin for the optimal solution attractor. This problem is connected with the
smoothness of the likelihood surface. A rough surface translates into many local maxima at which
the algorithm can get stuck, although introducing randomness to the optimizer, as in simulated
annealing, can increase the chance of finding a global optimum.
In the first experiment, all parameters were assigned initial (arbitrary) values within
approximately 20% of the true values. I recorded the convergence of the idealization algorithm,
focusing on JumpAmp, JumpStd, MeasStd, BaseAmp and LL (log-likelihood). Figures
[5.4.2.1.3] through [5.5.2.1.10] show the results. For SNR 2, JumpStd 0.2 and SNR 1 the
algorithm did not converge (results not shown). For other noise conditions (SNR 2 to 10,
JumpStd 0 to 0.2), the convergence is robust.
A second set of experiments determined the importance of each parameter. To all but one
parameter I assigned their respective true values. The remaining parameter was assigned different
initial values and the convergence observed. BaseAmp appears to have no effect at all,
irrespective of SNR (results not shown). In fact, in a preliminary step, BaseAmp is compared
against the first data point. If the absolute difference between the two is larger than two
205
measurement noise standard deviations (MeasStd), then BaseAmp is re-assigned the value of the
first data point. Thus, a value of BaseAmp lying within two standard deviations of the first data
point always results in convergence to the same solution.
MeasStd has a very wide range over which the algorithm converges to the same optimal
solution (results not shown). For example, for data with SNR 10 and any JumpStd (0.0, 0.1 or
0.2), any value of MeasStd between 1E-10 to 2 gave convergence (true value is 0.1). For SNR 2
and JumpStd 0.0 or 0.1, the convergence interval was (0.01, 2.0), for a true value of 0.5. Clearly,
the initial guess of MeasStd anyone is likely to make will fall within this wide interval.
JumpAmp is the most critical of all parameters. Its attraction basin is wide for high SNR
but gets narrower as SNR degrades and especially as the jump variance rises. Figures [5.5.2.1.11]
through [5.5.2.1.18] show the convergence for different starting values of JumpAmp. At lower
SNR, multiple attractors become common. Submultiples of the true value (1/2, 1/3, 1/4) have
their own attractors. These situations can be easily recognized and corrected. The indicator is the
absence of first-order jumps in the idealization, a fact which is statistically highly improbable.
Although an exhaustive search in the parameter space is impossible, it appears that the
attraction basin is wide enough as to make the algorithm relatively robust with respect to the
choice of initial values. For all parameters, it is clear that as the SNR degrades, more iterations
are necessary to reach the convergence point.
It is common sense that idealization will be more accurate for longer dwell durations
(average time kinesin spends in one position). Longer levels can be tracked better and statistics
becomes more robust. Other than by modifying the intrinsic rate of the process (the kinetics), the
objective of longer dwells can be achieved only by a faster sampling rate, at higher bandwidth.
Alas, faster sampling implies that fewer photons contribute to the signal, and fewer photons
means higher measurement noise. The variance of the measurement noise is proportional to the
square of the number of photons per unit of time. Thus, one may have longer dwells only at the
price of higher noise. On the contrary, one may get lower measurement noise by slowing the
206
sampling rate, at the price of increasing the fraction of higher order jumps. Higher order jumps
are especially adverse when the jump variance is severe. For example, the variance of a forthorder jump is four times larger than the variance of a single jump. With such large variance, the
prediction becomes less accurate and misdetections are likely.
The question therefore arises as to what is the optimal compromise between measurement
noise and the fraction of higher order jumps. I did experiments to answer this question. Since the
lifetime of the fluorophore remains constant, irrespective of sampling, a doubling of the sampling
frequency implied a doubling of data length, a halving of frequency implied a halving of length,
and so on. In effect, each data set will have the same average number of jumps. The simulated
data with SNR 5 (JumpStd 0.0, 0.1 and 0.2) was the reference, with a data length of 300 points
and sampling time 500ms. Table [5.4.2.1.4] shows the simulation parameters for each generated
data set.
Figure [5.5.2.1.18] shows the convergence estimates of JumpAmp, MeasStd and the
forward rate. Although it appears that the measurement noise is estimated better when the dwells
are longer, the jump amplitude and the forward rate constant are estimated much better when
dwells are shorter but the noise is lower. It should be noted that the data set with the lowest noise
(SNR 10) corresponds to an average of only 2 points per sample. Pushing towards even lower
noise is likely to have a negative effect.
207
SNR
10
10*
10**
7.07
7.07*
7.07**
5
5*
5**
3.5
3.5*
3.5**
2.5
2.5*
2.5**
PtsPerSeg
75
75
75
150
150
150
300
300
300
600
600
600
1200
1200
1200
SegCnt
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
Sampling
[msec]
2000
2000
2000
1000
1000
1000
500
500
500
250
250
250
125
125
125
ForRate
[sec-1]
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
BackRate
[sec-1]
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
JumpAmp
JumpStd
MeasStd
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.0
0.1
0.2
0.0
0.1
0.2
0.0
0.1
0.2
0.0
0.1
0.2
0.0
0.1
0.2
0.1
0.1
0.1
0.141421
0.141421
0.141421
0.2
0.2
0.2
0.282843
0.282843
0.282843
0.4
0.4
0.4
Table [5.4.2.1.4]. Simulation parameters.
208
1 pA
2s
SNR 10
SNR 10*
SNR 10**
SNR 5
SNR 5*
SNR 5**
SNR 2
SNR 2*
SNR 2**
SNR 1
Figure [5.4.2.1.2]. Simulated data with different SNR. The true idealization (the red line) is
overlapped on top of each trace to make visible the deviation due to jump variance.
JumpAmp is 1 in each trace. MeasStd is 0.1, 0.2, 0.5 and 1.0, for SNR 10, 5, 2 and 1,
respectively. One star corresponds to JumpStd 0.1, two stars to JumpStd 0.2 and no star to
JumpStd 0.0.
209
SNR 10 (JumpStd 0.0)
A
JumpAmp
1.35
1.25
1.15
1.05
0.95
1
3
5
7
9
11
13
15
9
11
13
15
7
9
11
13
15
7
9
11
13
15
B
JumpStd
0.4
0.3
0.2
0.1
0
1
3
5
7
0.55
C
MeasStd
0.45
0.35
0.25
0.15
0.05
0.6
1
3
5
D
BaseAmp
0.5
0.4
0.3
0.2
0.1
0
1
3
5
20000
15000
10000
LL
E
5000
0
-5000
1
3
5
7
9
11
13
15
-10000
-15000
Iteration
Figure [5.4.2.1.3]. Effect of starting values on idealization convergence. Simulated data with
SNR 10 and JumpStd 0.0. A JumpAmp. B JumpStd. C MeasStd. D BaseAmp. E Loglikelihood. In all cases, the value of estimated parameters at convergence was the true value, or
very close to it.
210
SNR 10* (JumpStd 0.1)
A
JumpAmp
1.45
1.35
1.25
1.15
1.05
B
JumpStd
0.95
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
1
3
5
7
9
11
13
15
1
3
5
7
9
11
13
15
1.2
C
MeasStd
1
0.8
0.6
0.4
0.2
0
0.6
1
3
5
7
9
11
13
15
1
3
5
7
9
11
13
15
D
BaseAmp
0.5
0.4
0.3
0.2
0.1
0
20000
E
LL
10000
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-10000
-20000
-30000
Iteration
Figure [5.4.2.1.4]. Effect of starting values on idealization convergence. Simulated data with
SNR 10 and JumpStd 0.1. A JumpAmp. B JumpStd. C MeasStd. D BaseAmp. E Loglikelihood. In all cases, the value of estimated parameters at convergence was the true value, or
very close to it.
211
SNR 10** (JumpStd 0.2)
A
JumpAmp
1.35
1.25
1.15
1.05
B
JumpStd
0.95
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
1
3
5
7
9
11
13
15
1
3
5
7
9
11
13
15
1.2
C
MeasStd
1
0.8
0.6
0.4
0.2
0
0.6
1
3
5
7
9
11
13
15
1
3
5
7
9
11
13
15
D
BaseAmp
0.5
0.4
0.3
0.2
0.1
0
20000
E
LL
10000
0
1
3
5
7
9
11
13
15
-10000
-20000
Iteration
Figure [5.4.2.1.5]. Effect of starting values on idealization convergence. Simulated data with
SNR 10 and JumpStd 0.2. A JumpAmp. B JumpStd. C MeasStd. D BaseAmp. E Loglikelihood. In all cases, the value of estimated parameters at convergence was the true value, or
very close to it.
212
SNR 5 (JumpStd 0.0)
A
JumpAmp
1.35
1.25
1.15
1.05
0.95
0.5
1
3
5
7
9
11
13
15
9
11
13
15
9
11
13
15
9
11
13
15
B
JumpStd
0.4
0.3
0.2
0.1
0
1
3
5
7
1.05
C
MeasStd
0.85
0.65
0.45
0.25
0.05
0.6
1
3
5
7
D
BaseAmp
0.5
0.4
0.3
0.2
0.1
0
1
0
E
LL
-5000
3
1
5
3
7
5
7
9
11
13
15
-10000
-15000
-20000
-25000
Iteration
Figure [5.4.2.1.6]. Effect of starting values on idealization convergence. Simulated data with
SNR 5 and JumpStd 0.0. A JumpAmp. B JumpStd. C MeasStd. D BaseAmp. E Loglikelihood. In all cases, the value of estimated parameters at convergence was the true value, or
very close to it.
213
SNR 5* (JumpStd 0.1)
A
JumpAmp
1.35
1.25
1.15
1.05
0.95
0.5
1
3
5
7
9
11
13
15
B
JumpStd
0.4
0.3
0.2
0.1
0
1
3
5
7
9
11
13
15
1
3
5
7
9
11
13
15
1
0
3
5
7
9
11
13
15
1.2
C
MeasStd
1
0.8
0.6
0.4
0.2
0
0.6
D
BaseAmp
0.5
0.4
0.3
0.2
0.1
0
E
LL
-5000
1
3
5
7
9
11
13
15
-10000
-15000
-20000
-25000
Iteration
Figure [5.4.2.1.7]. Effect of starting values on idealization convergence. Simulated data with
SNR 5 and JumpStd 0.1. A JumpAmp. B JumpStd. C MeasStd. D BaseAmp. E Loglikelihood. In all cases, the value of estimated parameters at convergence was the true value, or
very close to it.
214
SNR 5** (JumpStd 0.2)
A
JumpAmp
1.45
1.35
1.25
1.15
1.05
0.95
0.4
1
3
5
7
9
11 13 15 17 19 21
1
3
5
7
9
11 13 15 17 19 21
B
JumpStd
0.35
0.3
0.25
0.2
0.15
1.2
C
MeasStd
1
0.8
0.6
0.4
0.2
0
0.6
1
3
5
7
9
11 13 15
17 19 21
1
0
3
5
7
9
11 13 15 17 19 21
D
BaseAmp
0.5
0.4
0.3
0.2
0.1
0
E
LL
-5000
1
3
5
7
9
11 13 15 17 19 21
-10000
-15000
-20000
-25000
Iteration
Figure [5.4.2.1.8]. Effect of starting values on idealization convergence. Simulated data with
SNR 5 and JumpStd 0.2. A JumpAmp. B JumpStd. C MeasStd. D BaseAmp. E Loglikelihood. In all cases, the value of estimated parameters at convergence was the true value, or
very close to it.
215
SNR 2 (JumpStd 0.0)
A
JumpAmp
1.2
1.15
1.1
1.05
1
0.95
1
B
JumpStd
0.4
6 11 16 21 26 31 36 41 46 51 56 61
0.3
0.2
0.1
0
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61
1.05
C
MeasStd
0.95
0.85
0.75
0.65
0.55
0.45
0.6
1
6 11 16 21 26 31 36 41 46 51 56 61
D
BaseAmp
0.5
0.4
0.3
0.2
0.1
0
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61
-27500
E
LL
-28000
1
6 11 16 21 26 31 36 41 46 51 56 61
-28500
-29000
-29500
-30000
Iteration
Figure [5.4.2.1.9]. Effect of starting values on idealization convergence. Simulated data with
SNR 2 and JumpStd 0.0. A JumpAmp. B JumpStd. C MeasStd. D BaseAmp. E Loglikelihood. This appears to be the lowest SNR to be worthy of analysis.
216
SNR 2* (JumpStd 0.1, kv0.05)
1.25
A
JumpAmp
1.2
1.15
1.1
1.05
1
C
JumpStd
B
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
MeasStd
0.95
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4
1
4
7 10 13 16 19 22 25 28 31 34 37 40
1
4
7 10 13 16 19 22 25 28 31 34 37 40
1
4
7 10 13 16 19 22 25 28 31 34 37 40
1
4
7 10 13 16 19 22 25 28 31 34 37 40
0.6
D
BaseAmp
0.5
0.4
0.3
0.2
0.1
0
-27000
-27500 1 4 7 10 13 16 19 22 25 28 31 34 37 40
E
LL
-28000
-28500
-29000
-29500
-30000
-30500
Iteration
Figure [5.4.2.1.10]. Effect of starting values on idealization convergence. Simulated data with
SNR 2 and JumpStd 0.0. A JumpAmp. B JumpStd. C MeasStd. D BaseAmp. E Loglikelihood. The kv factor was 0.05, instead of 0.1 for the other experiments (equation
[5.3.1.1.9]).
217
A
SNR 10* (sim JumpStd 0.1)
1.7
BaseStd 0.001
Kv 0.1
1.5
JumpAmp
1.3
1.1
0.9
0.7
0.5
0.3
0.1
1
2 3
4
5 6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21
Iteration
B
True
Estimated
JumpAmp
1.0
1.00011
JumpStd
0.1
0.0990693
MeasStd
0.1
0.0935188
BaseAmp
0.0
0.00143537
IDL_LL
JumpRate
16873.5888
0.25
1 pA
5s
0.243038
±StdErr
0.016674
KIN_LL
-11329.02195
C
Figure [5.4.2.1.12]. Effect of initial JumpAmp values on idealization. Simulated data with
SNR 10 and JumpStd 0.1. A Convergence of estimated JumpAmp. B Estimated parameters
corresponding to the convergence point. C Idealization corresponding to the convergence
point.
218
A
SNR 10** (sim JumpStd 0.2)
1.7
BaseStd 0.001
Kv 0.1
1.5
JumpAmp
1.3
1.1
0.9
0.7
0.5
0.3
0.1
1
2 3
4
5 6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21
Iteration
B
True
Estimated
JumpAmp
1.0
1.0045
JumpStd
0.2
0.182314
MeasStd
0.1
0.0975591
BaseAmp
0.0
0.00143537
IDL_LL
JumpRate
15851.34326
0.25
1 pA
5s
0.242633
±StdErr
0.016688
KIN_LL
-11338.0933
C
Figure [5.4.2.1.13]. Effect of initial JumpAmp values on idealization. Simulated data with
SNR 10 and JumpStd 0.2. A Convergence of estimated JumpAmp. B Estimated parameters
corresponding to the convergence point. C Idealization corresponding to the convergence
point.
219
A
SNR 5
1.7
BaseStd 0.001
Kv 0.1
1.5
JumpAmp
1.3
1.1
0.9
0.7
0.5
0.3
0.1
1
2 3
4
5 6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21
Iteration
B
True
Estimated
JumpAmp
1.0
0.999997
JumpStd
0.0
0.0889965
MeasStd
0.2
0.186533
BaseAmp
0.0
0.0046777
IDL_LL
JumpRate
3495.257889
0.25
1 pA
5s
0.243038
±StdErr
0.016674
KIN_LL
-11332.4876
C
Figure [5.4.2.1.14]. Effect of initial JumpAmp values on idealization. Simulated data with
SNR 5 and JumpStd 0.0. A Convergence of estimated JumpAmp. B Estimated parameters
corresponding to the convergence point. C Idealization corresponding to the convergence
point.
220
A
SNR 5* (sim JumpStd 0.1)
1.7
BaseStd 0.001
Kv 0.1
1.5
JumpAmp
1.3
1.1
0.9
0.7
0.5
0.3
0.1
1
2 3
4
5 6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21
Iteration
B
True
Estimated
JumpAmp
1.0
0.999395
JumpStd
0.1
0.125505
MeasStd
0.2
0.188138
BaseAmp
0.0
0.00473288
IDL_LL
JumpRate
-3604.803269
0.25
1 pA
5s
0.243984
±StdErr
0.016641
KIN_LL
-11358.96023
C
Figure [5.4.2.1.15]. Effect of initial JumpAmp values on idealization. Simulated data with
SNR 5 and JumpStd 0.1. A Convergence of estimated JumpAmp. B Estimated parameters
corresponding to the convergence point. C Idealization corresponding to the convergence
point.
221
A
SNR 5** (sim JumpStd 0.2)
1.7
BaseStd 0.001
Kv 0.1
1.5
JumpAmp
1.3
1.1
0.9
0.7
0.5
0.3
0.1
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41
Iteration
B
True
Estimated
JumpAmp
1.0
0.999827
JumpStd
0.2
0.195756
MeasStd
0.2
0.194197
BaseAmp
0.0
0.0122417
IDL_LL
JumpRate
-4222.133304
0.25
1 pA
5s
0.244457
±StdErr
0.016626
KIN_LL
-11406.03537
C
Figure [5.4.2.1.16]. Effect of initial JumpAmp values on idealization. Simulated data with
SNR 5 and JumpStd 0.2. Two close convergence points. A Convergence of estimated
JumpAmp. B Estimated parameters corresponding to the best convergence point. C
Idealization corresponding to the best convergence point.
222
A
SNR 2
1.7
1.5
BaseStd 0.001
Kv 0.1 (0.01)
JumpAmp
1.3
1.1
0.9
0.7
0.5
0.3
0.1
1
9
17
25
33
41
49
57
65
73
81
89
97 105 113 121
Iteration
B
True
Estimated
JumpAmp
1.0
0.999261
JumpStd
0.0
0.2326
MeasStd
0.5
0.479004
BaseAmp
0.0
0.0176407
IDL_LL
JumpRate
-28024.97234
0.25
1 pA
5s
0.241957
±StdErr
0.016711
KIN_LL
-11236.97433
C
Figure [5.4.2.1.17]. Effect of initial JumpAmp values on idealization. Simulated data with
SNR 2 and JumpStd 0.0. A Convergence of estimated JumpAmp. B Estimated parameters
corresponding to the convergence point. C Idealization corresponding to the convergence
point.
223
A
SNR 2* (sim JumpStd 0.1)
1.7
BaseStd 0.001
Kv 0.1
1.5
JumpAmp
1.3
1.1
0.9
0.7
0.5
0.3
0.1
1
4
7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61
Iteration
B
True
Estimated
JumpAmp
1.0
0.993487
JumpStd
0.1
0.249441
MeasStd
0.5
0.480998
BaseAmp
0.0
0.0185922
IDL_LL
JumpRate
-28094.42772
0.25
1 pA
5s
0.244254
±StdErr
0.016632
KIN_LL
-11327.69503
C
Figure [5.4.2.1.18]. Effect of initial JumpAmp values on idealization. Simulated data with
SNR 2 and JumpStd 0.1. Two close convergence points. A Convergence of estimated
JumpAmp. B Estimated parameters corresponding to the best convergence point. C
Idealization corresponding to the best convergence point.
224
A
B
1.08
C
0.4
0.25
1.07
0.35
0.245
1.06
0.3
True
JumpStd 0.0
1.03
JumpStd 0.1
JumpStd 0.2
1.02
0.24
True
JumpStd 0.0
JumpStd 0.1
JumpStd 0.2
0.235
1.01
MeasStd
1.04
Forward rate
JumpAmp
1.05
True
0.25
JumpStd 0.0
JumpStd 0.1
0.2
JumpStd 0.2
0.15
0.23
1
0.1
0.99
0.225
0.98
10
7.07
5
SNR
3.5
2.5
0.05
10
7.07
5
SNR
3.5
2.5
10
7.07
5
3.5
2.5
SNR
Figure [5.4.2.1.19]. Compromise between measurement noise and jump order. Simulated data with different SNR, JumpStd 0.0, 0.1 and
0.2. A doubling in SNR corresponds to an average dwell duration (in data points) four times smaller. Likewise, a halving in SNR
corresponds to an average dwell duration four times larger. A Convergence value of JumpAmp at different SNR. B Convergence value
of forward rate at different SNR. C Convergence value of MeasStd at different SNR. The thin black line is the true value. The thick red,
blue and green lines are the estimates from data with JumpStd 0.0, 0.1 and 0.2, respectively. If the simulated JumpStd is 0.0 or 0.1, the
estimates of JumpAmp and forward rate are very accurate, regardless of SNR and average dwell duration. For simulated JumpStd 0.2,
the accuracy of the two estimates drops, especially at low SNR. As expected, the accuracy of the MeasStd estimate is better for longer
dwells, even though the SNR is low.
225
5.4.2.2 Idealization of experimental data
The experimental data were all generated by Dr. Selvin’s laboratory with an irreversible,
one state per position model. The files were recorded with a sampling time of 500 ms. The jump
amplitude varied, taking values of ~7.5 nm, ~13 nm and ~30 nm. Some files were generated with
equal steps, of ~ 8 points. Both the stochastic and the deterministic steps files correspond to a
forward rate of 0.25s-1. The experiment mimicked the characteristics of the signal obtained when
the fluorophore is positioned on the body of the kinesin molecule. The amplitude of each step was
meant to be constant. Nevertheless, for reasons explained previously, the jumps appear to have a
non-zero variance. There is no significant baseline drift visible.
Figures [5.4.2.2.1] through [5.4.2.2.18] show idealization and kinetics estimation results.
As with simulated data, JumpAmp has a wide attraction basin in the low noise records, which
narrows down and presents multiple maxima in high noise records. The kv parameter determines
the number of these maxima. A value of 0.1 induces multiple maxima, while a value of 0.5
oftentimes eliminates all but one. However, the one left maximum is not necessarily the best, as
the human eye would estimate. Overall, the idealization works even for the highest noise data.
What seems to have the most negative effects is the jump variance. Visual inspection indicates a
non-Gaussian distribution of jumps, almost bimodal. Hopefully, data from true kinesin
experiments will be better!
226
A
30nm_1_1_NoOffset.qdf
4.5
4
BaseStd 0.001
Kv 0.1
JumpAmp
3.5
3
2.5
2
1.5
1
0.5
0
1
2 3
4
5 6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21
Iteration
B
True
Estimated
JumpAmp
2.84716
JumpStd
0.166185
MeasStd
0.299578
BaseAmp
-0.00692272
IDL_LL
-134.5648837
JumpRate
0.25
2 pA
10 s
0.255714
±StdErr
0.188982
KIN_LL
-89.05796189
C
Figure [5.4.2.2.1]. Idealization and kinetics estimation of experimental data. JumpAmp ~30
nm. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the
convergence point. C Idealization corresponding to the convergence point. Here and in the rest
of figures, the idealization(s) in C refer to the convergence trajectories represented by the blue
lines in A.
227
A
13nm_1_1_NoOffset.qdf
20
BaseStd 0.001
Kv 0.1
JumpAmp
15
10
5
0
1
3
5
7
9
11
13
15
17
19
21
23
25
Iteration
B
True
10.4767
11.8754
JumpStd
0.722093
1.11356
MeasStd
1.35247
1.36616
BaseAmp
0.0531579
0.0531579
IDL_LL
-200.4748506
-196.3965068
JumpRate
C
Estimated
JumpAmp
0.258071
0.215059
±StdErr
0.25
0.288674
0.316179
KIN_LL
-37.265457
-32.30015296
10.4767
11.8754
10 pA
2s
Figure [5.4.2.2.2]. Idealization and kinetics estimation of experimental data. JumpAmp ~13
nm. Two convergence points. A Convergence of estimated JumpAmp. B Estimated
parameters corresponding to the two convergence points. C Idealizations corresponding to the
two convergence points. In the top trace, one jump is detected as second-order, and another
one is one point long, with a very small amplitude. The second convergence point should thus
be preferred.
228
A
13nm_1_1_NoOffset.qdf
BaseStd 0.001
Kv 1
20
JumpAmp
15
10
5
0
1
2
3
4 5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21
Iteration
B
True
10.5368
11.8754
JumpStd
0.737277
1.11356
MeasStd
1.36616
1.36616
BaseAmp
0.0531579
0.0531579
IDL_LL
-201.400531
-196.3965068
JumpRate
C
Estimated
JumpAmp
0.258071
0.215059
±StdErr
0.25
0.288674
0.316179
KIN_LL
-37.95858698
-32.30015296
10.5368
11.8754
10 pA
2s
Figure [5.4.2.2.3]. Idealization and kinetics estimation of experimental data. JumpAmp ~13
nm. Two convergence points. A Convergence of estimated JumpAmp. B Estimated
parameters corresponding to the two convergence points. C Idealizations corresponding to the
two convergence points.
229
13nm_R_1_2_NoOffset.qdf
A
20
BaseStd 0.001
Kv 0.1
JumpAmp
15
10
5
0
1
3
5
7
9
11
13
15
17
19
21
23
25
Iteration
True
B
11.796
12.4732
JumpStd
1.53067
1.30292
MeasStd
1.97147
1.88304
BaseAmp
-0.00083135
-0.000833217
IDL_LL
-272.1727872
-271.7227696
0.252258
0.23424
±StdErr
0.267265
0.277356
KIN_LL
-42.98663492
-40.87956366
JumpRate
C
Estimated
JumpAmp
0.25
11.796
12.4732
10 pA
5s
Figure [5.4.2.2.4]. Idealization and kinetics estimation of experimental data. JumpAmp ~13
nm, regular steps. Two convergence points. A Convergence of estimated JumpAmp. B
Estimated parameters corresponding to the two convergence points. C Idealizations
corresponding to the two convergence points. The first trace has a one-point long level, falsely
detected (encircled area, three levels before the end of trace). The second convergence point
should be preferred.
230
A
13nm_R_1_2_NoOffset.qdf
20
BaseStd 0.001
Kv 0.5
JumpAmp
15
10
5
0
1
3
5
7
9
11
13
15
17
19
21
23
25
Iteration
B
True
Estimated
JumpAmp
12.5017
JumpStd
1.30159
MeasStd
1.90341
BaseAmp
-0.000833193
IDL_LL
-272.0105373
JumpRate
0.25
0.23424
±StdErr
0.277356
KIN_LL
-40.87956366
10 pA
5s
C
12.4732
Figure [5.4.2.2.5]. Idealization and kinetics estimation of experimental data. JumpAmp ~13
nm, regular steps. Only one convergence point remains. A Convergence of estimated
JumpAmp. B Estimated parameters corresponding to the convergence point. C Idealizations
corresponding to the convergence point.
231
A
13nm_R_1_3_NoOffset.qdf
20
BaseStd 0.001
Kv 0.1
JumpAmp
15
10
5
0
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
Iteration
B
True
Estimated
JumpAmp
12.1984
12.6826
JumpStd
1.78983
1.70529
MeasStd
2.54954
2.68061
BaseAmp
0.00752546
0.00754855
IDL_LL
-524.7856073
-534.5397797
JumpRate
0.248711
0.238348
±StdErr
0.25
0.204126
0.208518
KIN_LL
-74.03129748
-71.92553022
12.1984
C
12.6826
10 pA
5s
Figure [5.4.2.2.6]. Idealization and kinetics estimation of experimental data. JumpAmp ~13
nm, regular steps. Two convergence points. A Convergence of estimated JumpAmp. B
Estimated parameters corresponding to the two convergence points. C Idealizations
corresponding to the two convergence points. Both traces present idealization mistakes (each
dwell should be 8 points long). The bottom trace skips a level.
232
13nm_R_1_3_NoOffset.qdf
A
BaseStd 0.001
Kv 0.5
20
JumpAmp
15
10
5
0
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
Iteration
True
B
10.495
11.0631
12.1941
JumpStd
2.59511
2.46728
1.82039
MeasStd
2.43343
2.52081
2.5148
BaseAmp
0.00762281
0.00758933
0.00751988
IDL_LL
-518.2468264
-521.2401661
-524.4381954
0.290162
0.269436
0.248711
±StdErr
0.188982
0.196109
0.204126
KIN_LL
-82.05363286
-78.11946411
-74.03129748
JumpRate
C
Estimated
JumpAmp
0.25
10.495
11.0631
12.1941
5 pA
5s
Figure [5.4.2.2.7]. Idealization and kinetics estimation of experimental data. JumpAmp ~13
nm, regular steps. Three convergence points. A Convergence of estimated JumpAmp. B
Estimated parameters corresponding to the three convergence points. C Idealizations
corresponding to the three convergence points. All traces show idealization mistakes (each
dwell should be 8 points long).
233
8nm_1_1_NoOffset.qdf
A
14
BaseStd 0.001
Kv 0.1
12
JumpAmp
10
8
6
4
2
0
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
Iteration
True
B
Estimated
JumpAmp
5.79015
7.84122
8.26262
JumpStd
1.37395
1.25819
1.27502
MeasStd
2.02905
2.14578
2.15461
BaseAmp
0.00788369
0.00567071
0.00535911
-521.4593945
-525.0238662
-525.5139116
0.388069
0.288564
0.278614
±StdErr
0.160131
0.185695
0.188971
KIN_LL
-110.6922685
-86.53055796
-84.57710351
IDL_LL
JumpRate
0.25
5.79015
7.84122
8.26262
C
5 pA 5 s
Figure [5.4.2.2.8]. Idealization and kinetics estimation of experimental data. JumpAmp ~8
nm. Multiple convergence points. A Convergence of estimated JumpAmp. B Estimated
parameters corresponding to the convergence points. C Idealizations corresponding to the
convergence points. All traces show idealization mistakes (there should be no downward
jumps).
234
8nm_1_1_NoOffset.qdf
A
14
12
BaseStd 0.001
Kv 0.5
JumpAmp
10
8
6
4
2
0
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
Iteration
True
B
Estimated
JumpAmp
7.82956
8.25788
8.41777
JumpStd
1.38656
1.32626
1.34987
MeasStd
2.13227
2.16621
2.24888
BaseAmp
0.00482881
0.00499026
0.00763731
-521.4143721
-526.0792001
-530.3195459
0.288564
0.278614
0.268663
±StdErr
0.185695
0.188971
0.192449
KIN_LL
-86.53055796
-84.57710351
-82.58792727
IDL_LL
JumpRate
0.25
7.82956
8.25788
8.41777
C
5 pA 5 s
Figure [5.4.2.2.9]. Idealization and kinetics estimation of experimental data. JumpAmp ~8
nm. Multiple convergence points. A Convergence of estimated JumpAmp. B Estimated
parameters corresponding to the convergence points. C Idealizations corresponding to the
convergence points. All traces show idealization mistakes (there should be no downward
jumps).
235
A
8nm_1_2_NoOffset.qdf
10
JumpAmp
8
6
BaseStd 0.001
Kv 0.1
4
2
0
1
5
9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81
Iteration
B
True
Estimated
JumpAmp
8.56038
JumpStd
1.24968
MeasStd
2.65977
BaseAmp
0.429862
IDL_LL
-983.8559741
JumpRate
0.25
10 pA
10 s
0.231411
±StdErr
0.154309
KIN_LL
-133.2759656
C
Figure [5.4.2.2.10]. Idealization and kinetics estimation of experimental data. JumpAmp ~8
nm. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the
convergence point. C Idealizations corresponding to the convergence point. The idealization
shows mistakes (there should be no downward jumps).
236
8nm_1_2_NoOffset.qdf
A
10
JumpAmp
8
6
4
BaseStd 0.001
Kv 0.05
2
0
1
5
9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81
Iteration
True
B
Estimated
JumpAmp
6.65256
8.20798
8.55163
JumpStd
1.83096
1.47671
1.36764
MeasStd
2.45946
2.5698
2.63793
BaseAmp
0.530106
0.328046
0.400987
IDL_LL
-971.179494
-977.0504472
-984.7713595
JumpRate
0.297528
0.24794
0.236921
±StdErr
0.25
0.136083
0.149071
0.152487
KIN_LL
-158.2789413
-138.948362
-135.4208894
6.65256
8.20798
8.55163
C
5 pA 10 s
Figure [5.4.2.2.11]. Idealization and kinetics estimation of experimental data. JumpAmp ~8
nm. Multiple convergence points. A Convergence of estimated JumpAmp. B Estimated
parameters corresponding to the convergence points. C Idealizations corresponding to the
convergence points. All traces show idealization mistakes (there should be no downward
jumps).
237
8nm_1_3_NoOffset.qdf
A
10
JumpAmp
8
6
4
BaseStd 0.001
Kv 0.1
2
0
1
5
9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81
Iteration
True
B
Estimated
JumpAmp
7.65272
8.39352
JumpStd
2.43526
2.13624
MeasStd
2.58248
2.54294
BaseAmp
0.110356
0.0693778
IDL_LL
-1282.867179
-1279.352916
JumpRate
0.26051
0.239502
±StdErr
0.25
0.126999
0.132455
KIN_LL
-189.0667724
-178.6680823
7.65272
8.39352
C
5 pA 10 s
Figure [5.4.2.2.12]. Idealization and kinetics estimation of experimental data. JumpAmp ~8
nm. Two convergence points. A Convergence of estimated JumpAmp. B Estimated
parameters corresponding to the convergence points. C Idealizations corresponding to the
convergence points. Both traces show idealization mistakes (there should be no downward
jumps).
238
8nm_1_4_NoOffset.qdf
A
10
JumpAmp
8
BaseStd 0.001
Kv 0.1
6
4
2
0
1
3
5
7
9
11
13
15
17
19
21
23
25
Iteration
True
B
Estimated
JumpAmp
7.67527
8.08173
JumpStd
1.22278
1.01453
MeasStd
1.64263
1.73481
BaseAmp
0.00398573
0.00543289
IDL_LL
-219.8413309
-220.1677031
JumpRate
0.304355
0.282615
±StdErr
0.25
0.267283
0.277333
KIN_LL
-41.0513851
-38.43892516
7.67527
8.08173
C
5 pA
2s
Figure [5.4.2.2.13]. Idealization and kinetics estimation of experimental data. JumpAmp ~8
nm. Two convergence points. A Convergence of estimated JumpAmp. B Estimated
parameters corresponding to the convergence points. C Idealizations corresponding to the
convergence points. Both traces show idealization mistakes.
239
8nm_1_4_NoOffset.qdf
A
10
JumpAmp
8
6
BaseStd 0.001
Kv 0.5
4
2
0
1
3
5
7
9
11
13
15
17
19
21
23
25
Iteration
True
B
Estimated
JumpAmp
7.67607
8.08182
JumpStd
1.18239
1.01406
MeasStd
1.63747
1.73358
BaseAmp
0.00345376
0.00451518
IDL_LL
-219.8943929
-220.1981786
JumpRate
0.304355
0.282615
±StdErr
0.25
0.267283
0.277333
KIN_LL
-41.0513851
-38.43892516
7.67607
8.08182
C
5 pA
2s
Figure [5.4.2.2.14]. Idealization and kinetics estimation of experimental data. JumpAmp ~8
nm. Multiple convergence points. A Convergence of estimated JumpAmp. B Estimated
parameters corresponding to the convergence points. C Idealizations corresponding to the
convergence points. Both traces show idealization mistakes.
240
A
8nm_R_1_1_NoOffset.qdf
BaseStd 0.001
Kv 0.1
10
JumpAmp
8
6
4
2
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41
Iteration
Figure [5.4.2.2.15]. Idealization and kinetics estimation of experimental data. JumpAmp ~8
nm, regular steps. Multiple convergence points. A Convergence of estimated JumpAmp.
Data, idealization and estimated parameters are not shown, since no choice can be made. The
value of kv (0.1) is too small in this case (see next figure) and the idealization is very sensitive
to initial conditions.
241
8nm_R_1_1_NoOffset.qdf
A
10
BaseStd 0.001
Kv 0.5
JumpAmp
8
6
4
2
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41
Iteration
True
B
Estimated
JumpAmp
5.93294
6.13367
6.43941
JumpStd
1.18905
1.29722
1.65529
MeasStd
1.83456
1.78771
1.87153
BaseAmp
-0.531375
-0.530523
-0.529974
-827.3844281
-829.2303198
-838.3808526
0.255688
0.255688
0.244324
±StdErr
0.149072
0.149072
KIN_LL
-137.5636388
-138.2567688
0.152500
-133.4045793
IDL_LL
JumpRate
0.25
5.93294
6.13367
6.43941
C
2 pA 10 s
Figure [5.4.2.2.16]. Idealization and kinetics estimation of experimental data. JumpAmp ~8
nm, regular steps. Multiple convergence points. A Convergence of estimated JumpAmp. B
Estimated parameters corresponding to the convergence points. C Idealizations corresponding
to the convergence points. Both traces show idealization mistakes. Compared to the previous
figure, kv was increased to 0.5, and the idealization is less sensitive to initial conditions.
242
8nm_R_1_2_NoOffset.qdf
A
10
BaseStd 0.001
Kv 0.1
JumpAmp
8
6
4
2
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41
Iteration
True
B
6.4092
6.89654
JumpStd
1.08596
1.13374
MeasStd
1.61513
1.83598
BaseAmp
-0.548173
-0.0347897
-230.0236438
-238.8599363
0.252433
0.233015
±StdErr
0.277354
0.288682
KIN_LL
-39.90714887
-37.79787971
IDL_LL
JumpRate
C
Estimated
JumpAmp
0.25
6.4092
6.89654
5 pA
5s
Figure [5.4.2.2.17]. Idealization and kinetics estimation of experimental data. JumpAmp ~8
nm, regular steps. Multiple convergence points. A Convergence of estimated JumpAmp. B
Estimated parameters corresponding to the convergence points. C Idealizations corresponding
to the convergence points. Both traces show idealization mistakes. Bottom trace skips a level.
243
8nm_R_1_2_NoOffset.qdf
A
10
BaseStd 0.001
Kv 0.5
JumpAmp
8
6
4
2
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41
Iteration
B
True
Estimated
JumpAmp
6.92142
JumpStd
1.22733
MeasStd
1.78108
BaseAmp
0.00211386
IDL_LL
-237.8919407
JumpRate
0.25
5 pA
5s
0.233015
±StdErr
0.288682
KIN_LL
-37.79787971
C
Figure [5.4.2.2.18]. Idealization and kinetics estimation of experimental data. JumpAmp ~8
nm, regular steps. A single convergence point. A Convergence of estimated JumpAmp. B
Estimated parameters corresponding to the convergence point. C Idealizations corresponding
to the convergence point. The trace skips a level. Compared to the previous figure, a higher
value of kv (0.5), gives a single convergence point, irrespective of initial conditions.
244
5.5 Discussion
It is interesting to note how the idealization algorithm has stable maxima not only for the
true value of the jump amplitude but also for its submultiples of order 2, 3, 4 etc. For example, a
JumpAmp equal to 1/2 of the true value gives twice faster apparent kinetics, 1/3 gives thrice
faster etc. Confusion is avoided however, by inspecting the re-estimated Atr matrix. Figure [5.5.1]
shows the structure of the matrices corresponding to the estimated JumpAmp equal to 1/1, 1/2
and 1/3 of the true jump amplitude. For 1/1, the transition probabilities show an exponential
decay. For 1/2, they are interspersed with zeroes, caused by lack of evidence in the data for
transitions between consecutive states, and so on. In fact, a non-smooth shape of the Atr matrix
might indicate convergence to a non-optimal solution.
a[i,i+k]
1
0.8
1
0.6
1/3
1/2
0.4
0.2
0
0
1
2
3
4
5
k
Figure [5.5.1]. Structure of the Atr matrix. JumpAmp is equal to 1/1, 1/2, and 1/3 of the true
value. a[i,i+k] is the entry in the Atr matrix equal to the transition probability between state i
and state i+k, for any i, k>=0. Irreversible model with one kinetic state per unit.
Only for reversible models can the Atr matrix be calculated from Qtr by spectral
decomposition. For irreversible models, Qtr has confluent eigenvalues. One must then use small,
non-zero backward rates, to force reversibility. Using a reversible Atr in the idealization
algorithm has the potential for false detection of downward jumps, when, in fact, such jumps are
known not to happen. The Atr matrix must be prepared, by zeroing the entries corresponding to
245
backward transitions and row-wise renormalization. It is a property of the Forward-Backward
procedure that zero entries in the A matrix stay zero throughout the re-estimation. I have tested
both reversible and irreversible Atr matrices, with mixed results (not shown). An irreversible Atr
avoids, indeed, false detection of downward jumps. On the other hand, with high noise data, it
may occasionally result in less accurate parameters. If it is known a priori that only upward
jumps can occur, either an irreversible Atr must be used, or the false downward jumps detected
must be somehow eliminated.
When estimating kinetics for irreversible models, or when the analyzed data lacks jumps
in one direction, the model can be simplified. Thus, either the left or the right block states can be
eliminated, since those states will never be occupied in the likelihood chain [5.3.2.1.1]. The Atr
truncation error will likely be negligible.
It was not my objective here but further work is definitely required to determine the
parameter identifiability. In other words, how many free parameters can be estimated, when the
model has all states aggregated in one class? Clearly, at least two parameters can be estimated
from reversible data, because there is actually evidence for each: the upward and the downward
jumps.
246
6. Ensemble hidden Markov models. Applications to kinetic analysis of
whole-cell recordings
Ion channel kinetics is traditionally estimated from macroscopic currents by the method
of fitting exponentials. The macroscopic current relaxes towards equilibrium as a weighted sum
of NS-1 exponentials, where NS is the number of states in the model (Colquhoun and Hawkes,
1977). The time constants of the exponentials are the inverses of the Q matrix eigenvalues, and
the weighting factors are functions of the Q matrix eigenvectors. From the optimized time
constants and weighting factors, it is not easy to obtain the corresponding Q matrix (and thus the
rate constants), and, therefore, it is tempting to use single channel algorithms.
An ensemble of independent, identical Markov models is yet another Markov model
(Horn and Lange, 1983), hence single channel algorithms can be easily extended to multiple
channels (see, for example, Qin et al, 1996), but are practically useless when the channel count is
greater than e.g. 10. NC individual channels, each a Markov model, give a superposition model
(meta-model) with N S
NC
states (meta-states). For example, an ensemble with 10 two-state
channels has 1024 meta-states, but an ensemble of 100 channels has ~1E30 meta-states! The
average whole-cell patch may have a few orders of magnitude more channels. Clearly, these
numbers are not amenable to treatment by single channel algorithms.
I propose here an approximate maximum likelihood method that estimates directly the
rate constants, using hidden Markov models. Optionally, the method can estimate simultaneously
the rate constants and the number of channels in the patch. This is, to my knowledge, a novel
procedure. I refer to this new algorithm as Mac (Macroscopic currents optimizer). The principle
of the method is to model the observed current at any time t, It, as a stochastic quantity,
approximated by a Gaussian. The mean and variance of this Gaussian are functions of
conductance parameters, starting probabilities, rate constants and channel count. The critical
247
assumption is that there is no correlation between any two data points. In other words, the state
probabilities at any time t are a function of the initial probabilities only.
The user must provide a kinetic model, conductance parameters (i.e. single channel
current amplitude and noise for each conductance class), and initial estimates for rate constants
and channel count. The initial state probabilities can be user-input, or can be calculated by the
program when conditioning-test data are analyzed. Hence, the ion channel population is assumed
to be driven to equilibrium under the conditioning experimental conditions, and to relax towards a
new equilibrium under the test conditions. Thus, the state probabilities at the beginning of test can
be calculated as the equilibrium occupancies of conditioning. In principle, any kind of
experimental protocol can be analyzed, i.e. a sequence of two or more rectangular pulses, with
defined duration and value (ligand concentration, voltage etc). As with the MIP algorithm
described in Chapter 3, linear constraints on rate constants can be imposed, thus reducing the
number of free parameters. The algorithm provides standard errors of the estimates.
6.1 Ensemble hidden Markov model
I consider here an ensemble of NC identical, independent (non-interacting) ion channels.
The single channel model has NS states. At any time t, each of the NC channels is in one of the NS
states. Let Nti be the number of channels in state i, at time t. I define the state partition vector, Xt,
as follows:
N ti
x 
NC
i
t
x
i
i
t

[6.1.1]
1
NC
N
i
248
i
t
1
[6.1.2]
The element i of the state partition vector, xti, is equal to the fraction of channels in state i at time
t. The elements of Xt sum to 1. Xt takes discrete values, but it is continuous in the limit case of an
infinite number of channels, when it is equivalent to the state probability vector of a single
channel, Pt:
X  Pt
[6.1.3]
t
NC 
xti  Pstatet  i 
[6.1.4]
The equations describing the time dependence of an ensemble of molecules have
identical form with the equations for a single molecule, as mentioned in Chapter 2.1.1. It follows
that the state partition vector Xt can be calculated from the initial state probability vector, as in
equation [2.1.1.32]:
Xt  X0  expQ  t 
T
T
[6.1.5]
6.1.1 Conductance properties of an ensemble of ion channels
The signal and noise components in a single channel patch clamp record were presented
in Figure [4.3.1], and discussed in Chapter 4.3. In the following, I assume the signal and noise to
have characteristics constant in time. Furthermore, all Gaussian noise is assumed to be noncorrelated (white).
The current given by a single channel when in state i can be modeled as a random
Gaussian variable with mean i and variance Vyi. i and Vyi are, respectively, the mean of the
excess current and the variance of the excess noise of state i. By definition, all states in the close
249
conductance class have zero excess current and zero excess noise. Let  and Vy be the vectors of
dimension NS, with elements i and Vyi as defined before. The baseline current is another random
Gaussian variable with mean 0 and variance Vy0. Since a sum of random Gaussian variables is
another random Gaussian variable, the current given at time t by an ensemble of ion channels has
a Gaussian distribution, as follows:

I t  N  t ,Vt y

[6.1.1.1]
NS
 t   0   N ti   i
[6.1.1.2]
i 1
NS
Vt y  V y 0   N ti  Vi
[6.1.1.3]
i 1
NS
 t    N C   xti   i
0
[6.1.1.4]
i 1
Vt  V
y
NS
y0
 N C   xti  Vi
[6.1.1.5]
i 1
The mean t and the variance and Vty are functions of the state partition vector (Xt), channel
count (NC), baseline current (0, Vy0) and single channel current mean and variance vectors ( and
Vy). Equations [6.1.1.4] and [6.1.1.5] can be written in a more compact, vectorial form, as
follows:
 t   0  N C  Xt  μ
[6.1.1.6]
Vt y  V y 0  N C  Xt  V
[6.1.1.7]
T
T
250
Equations [6.1.1.6] and [6.1.1.7] give the mean and the variance of the current recorded at time t,
from an ensemble of ion channels, given a state partition vector Xt.
6.1.2 State fluctuations of an ensemble of ion channels
As mentioned previously, the state partition vector Xt can be calculated (predicted) from
a known initial state probability vector X0, with equation [6.1.5]. However, the predicted Xt is the
mean of a multinomial probability distribution, as discussed in Chapter 2.1.1 (see equation
[2.1.1.42] and the following), and the actual state vector Xt is expected to have fluctuations
around the mean. The covariance matrix of these multinomial fluctuations, VtMN, is a function of
the number of channels and the expected (average) state partition vector:
VtMN  N C  Vtx
[6.1.2.1]
Vtx is a matrix, the elements of which have the following properties:

Vtx i, i   xti  1  xti
Vtx i, j    xti  xtj

[6.1.2.2]
It follows that the current recorded at time t has two sources of variance: baseline noise
and state excess noise (equation [6.1.1.7]), and state fluctuations around the mean (equation
[6.1.2.1]). If the multinomial distribution is approximated with a Gaussian (the approximation is
good if N C  xti  1 ), all variance has Gaussian sources. Therefore, the current It can be
approximately described with a Gaussian distribution, conditional on a given multinomial
distribution of the expected state vector Xt, as follows:
251
 t   0  N C  Xt  μ
T

[6.1.2.3]
Vt  V 0  NC  μT  Vtx  μ  Xt  V
T

[6.1.2.4]
N C μ T Vtx μ is the linear transformation of the VtMN covariance matrix from the NS x NS state
space into the unidimensional measurement space.
6.2 Algorithm for ensemble hidden Markov model
As previously mentioned, the algorithm relies on the assumption that the state vector has
no temporal correlation. In other words, the state at any time t is assumed to be a function of the
initial state only. For a given record, this is certainly not true, as the state at time t is always a
function of the previous state at time t-1 (it is a Markov process), regardless of the number of
channels. Determined by the properties of Markov processes, a fit line that minimizes the sum of
square errors may underestimate the variance around the mean. Since a lower variance can be
explained, for example, by a lower NC (see equation [6.1.2.1]), the kinetics and the channel count
estimates will be biased.
The temporal correlation can be reduced if several data sets are analyzed together. At the
limit, the correlation disappears for an infinite number of data sets. In each data set, the state at
time t is still a function of the previous state, but, over all data sets, it is (approximately) a
function of the initial state only. The requirement that several data sets be analyzed together is
easily accommodated by suitable experimental design. In fact, it is always desirable to estimate
parameters from multiple data.
252
6.2.1 Likelihood function
The likelihood function to be optimized is the probability of observing the data sequence
Y, given the model parameters :
L  pY | θ
[6.2.1.1]
The data sequence Y is the recorded current. Since, as discussed, there is no temporal correlation
between data points, the probability of observing the whole data sequence Y is equal to the
product of the individual observation probabilities:
T
L   p yt | θ 
[6.2.1.2]
t 0
The objective is to find the parameter ML which maximizes the likelihood:
T

θ ML  arg max  p yt | θ 

 t 0

[6.2.1.3]
The likelihood may be a very small or a very big number. It is then preferable to maximize the
logarithm of the likelihood function (the logarithm of a function has the same maximum as the
function):
T
LL   ln p yt | θ 
t 0
253
[6.2.1.4]
T

θ ML  arg max  ln p yt | θ 

 t 0

[6.2.1.5]
Since the observation probability p(yt|) is a function of time, I will write it as pt(yt|).
The observation probability pt(yt|) is a Gaussian:
pt  yt | θ  N  t ,Vt 
[6.2.1.6]
The mean t and variance Vt are as discussed (equations [6.1.2.3] and [6.1.2.4]).
6.2.2 Parameterization and constraints
I use the same parameterization and constraints as in Chapter 3.1.2. If the number of
channels NC is optimized, then it must be under the constraint that it is positive. I define the
variable u as:
u  ln N C
[6.2.2.1]
Obviously, a lower limit of the channel count estimate can be directly obtained from data,
by dividing the peak of the observed signal to the single channel amplitude. In this case, the
parameterization should be modified as follows:
u  ln N C
[6.2.2.2]
N C  N Cmin  N C
[6.2.2.3]
254
The estimated channel count, NC, is a sum between a minimum count, NCmin, which is a constant,
and an excess count NC+, which is variable and optimized. In the present work, I did not use a
lower limit, yet the algorithm performed satisfactorily. However, this issue needs to be further
explored.
6.2.3 Computation and derivatives of the likelihood function
All the necessary quantities for the likelihood function are now available. The
observation probability in expression [6.2.1.6], its natural logarithm, and the derivative of the
logarithm with respect to a variable z are the following:
 1  yt   t 2 

pt  y t | θ  
exp 
 2

V
2  Vt
t


1
[6.2.3.1]
1
1
1  yt   t 
ln pt  yt | θ    ln2   lnVt  
2
2
2
Vt
2
[6.2.3.2]
2

1 Vt  yt  t  t 1  yt  t  Vt


ln pt  yt | θ   

 
z
2Vt z  Vt  z 2  Vt  z
[6.2.3.3]
The log-likelihood function in expression [6.2.1.4] and its derivative with respect to a
variable z are:
T 1
1 T
1 T y  t 
ln2    lnVt    t
2
2 t 0
2 t 0
Vt
2
LL  
255
[6.2.3.4]
 yt   t
LL
1  1 Vt 
  

  
z
2 t Vt z  t  Vt
  t  1  yt   t

   
 z  2 t  Vt
2
 Vt 


 z 
[6.2.3.5]
The mean and variance of the observation probability, t and Vt are calculated with
expressions [6.1.2.3] and [6.1.2.4]. In the following, I show how to calculate their derivatives,
together with other auxiliary quantities and their derivatives with respect to a variable z:
t N C
X
T

 Xt  μ  N C  t  μ
z
z
z
T


T


Vt N C
V x
Xt
T

 μT  Vtx  μ  Xt  V  N C   μT  t  μ 
 V 
z
z
z
z


[6.2.3.6]
[6.2.3.7]
Xt  X0  At
[6.2.3.8]
At  expQ  t 
[6.2.3.9]
Xt
X0
T A

 At  X0  t
z
z
z
[6.2.3.10]
T
T
T
T
At is calculated from Q using the spectral decomposition technique, indicated in Chapter
3. The derivatives of At and Xt with respect to a variable z (a macroscopic rate qij) are also given
in Chapter 3. The derivative of the Vtx matrix is another matrix with elements:


Vtx
i ,i  1  2  xti  Xt i
z
z
[6.2.3.11]
Vtx
i , j    Xt i xtj  xti  Xt  j 
z
z
z
[6.2.3.12]
256
One may notice that the log-likelihood function in expression [6.2.3.4] reduces to a sum
of square errors if the constant terms and the variance-containing terms are eliminated:
SS 
1 T
 y t   t 2

2 t 0
[6.2.3.13]
6.3 Results
I tested the algorithm with simulated data, generated by the three state model in Figure
[6.3.1]. The model has up to five free parameters to be optimized: four rate constants k12, k21, k23
and k32 and the channel count NC. The standard deviation of the baseline noise is 0.2 pA. The
baseline is 0 pA and it is constant. The excess noise of the open conductance class has a standard
deviation of ~0.2236. The excess amplitude of the open class is 1 pA.
1
500.0
200.0
2
100.0
50.0
3
Figure [6.3.1]. Model used for simulation.
6.3.1 Bias of kinetics estimation
The most important characteristic of an optimization algorithm is the bias of the estimate.
I implemented the algorithm with the option of using as cost function either the likelihood in
expression [6.2.3.4], or the sum of square errors in [6.2.3.13]. Due to the various approximations
made, it seems possible that the likelihood version may present a certain bias. The sum of squares
257
version, however, is expected to have no bias at all. Obviously, without the variance-containing
terms, it is not possible to estimate channel count.
To determine if kinetics estimation is biased, a theoretically infinite amount of data
should be analyzed. Instead, I emulate the infinite data prerequisite with a deterministic
simulation, which has neither measurement noise, nor channel fluctuations. The parameters of the
simulation are given in Table [6.3.1]. The kinetics was estimated with the same model as in
Figure [6.3.1], with the true conductance parameters and channel count, and the rate constants
initialized at their true values. For comparison, I also generated stochastically simulated data.
1000 simulated data sets were averaged and rate constants were estimated from the average.
PtsPerSeg
SegCnt
4000
1
Sampling
[kHz]
50
Channel
Count
1000
Starting
State
1
Table [6.3.1]. Simulation parameters. PtsPerSeg is the number of points in a data segment, and
SegCnt is the number of segments.
The results in Figure [6.3.2] show that the two cost functions give practically the same
accurate estimates, hence the algorithm appears to be unbiased when the kinetics alone is
estimated.
Figure [6.3.3] gives a visual measure of the variance around the fitted line. Panel A
shows a single data set, and panel B shows 10 overlapped data sets, fitted together. The fit line is
obtained with the log-likelihood cost function. The variance of data around the fitted curve is
visibly smaller in panel A than in B. Since the estimation of channel count is based on measuring
the variance around the conditional mean, an underestimated variance may lead to an
underestimated channel count.
Figures [6.3.4] and [6.3.5] show the distribution of the estimate for each rate constant,
when finite amounts of data are analyzed. 100 stochastically simulated data sets are generated
258
with the parameters in Table [6.3.1], and optimized individually for rate constants (channel count
is fixed at the true value). For either the log-likelihood or the sum of squares, the average rates are
very close to the true values, and have a Gaussian distribution.
6.3.2 Simultaneous estimation of kinetics and channel count
With the properties of the kinetics estimation established, I tested the simultaneous
estimation of kinetics and channel count. The log-likelihood function is used in all runs where the
channel count is a free parameter. 100 stochastically simulated data sets are generated with the
parameters in Table [6.3.1], and optimized individually for rate constants and channel count.
Figure [6.3.6] shows the results. As expected, the channel count cannot be reliably obtained from
a single data set and is clearly underestimated. All optimized parameters are extremely scattered,
especially k21. Whereas channel count is underestimated, k12 is overestimated, and the two are
strongly correlated.
When several data sets are used together, the estimation improves considerably. Figure
[6.3.7] shows results for 1000 data sets simulated as before and optimized in groups of 10. The
scatter, as well as the bias of the estimates is much smaller. There is strong negative correlation
between channel count and k12, and positive correlation between channel count and k21 and k23.
Obviously, the peak current can be explained either by faster rates leading to the open states or by
higher number of channels, mutually exclusive. Optimizing even more data sets together gives
even better result (not shown), in terms of scatter and bias.
The next experiment determines the range of channel count within which the algorithm
can give reliable estimates. 1000 data sets were simulated as before and optimized in groups of
10. The channel count was 10, 100, 1000 and 10000, respectively. Figure [6.3.8] shows the
optimization results. The algorithm can optimize the channel count over a wide range, but gives
better estimates for larger values. 10 channels is certainly an extreme case that is better
259
approached with single channel methods. As expected, the scatter of the estimates is large for 10
channels, but becomes quite small for 1000 or 10000 channels.
6.3.3 Effect of conductance parameters
The tests were so far conducted using the true values of the conductance parameters. In
practice, these are known only approximately, and thus it is important to determine how sensitive
the algorithm is with respect to them. Figure [6.3.9] shows the effects the standard deviation of
the baseline noise and of the open class excess noise have on the kinetics and channel count
estimates. It appears that the baseline standard deviation has practically no effect (panel A).
Evidently, the noise of a record with 1000 channels has only one part in a 1000 contribution from
baseline. The open class noise (defined as baseline noise plus open class excess noise) has a more
serious effect. The graph in Figure [6.3.9], panel B, indicates that underestimation of open class
noise is slightly better than overestimation.
The individual channel amplitude (excess conductance of open class) is tied with the
channel count parameter (results not shown). The same current can be the result of a larger
channel count and a smaller single channel amplitude or, alternatively, of a smaller channel count
and larger amplitude. The two situations can be distinguished, to some extent, from the variance.
That is, of course, if the other conductance parameters are known. Further work is necessary to
determine whether the optimizer can simultaneously estimate kinetics, channel count and (some
of) the conductance parameters.
6.3.4 Effect of channel count
260
Figure [6.3.10] shows the effect of the channel count parameter (not optimized) upon
kinetics estimation, when the single channel amplitude is known. Whatever the optimization
method, different values for channel count give different sets of rate constants. 10 data sets are
optimized together. Panel A shows the rate estimates using the log-likelihood cost function. For
approximately +/-30% variation in channel count, some rates span almost an order of magnitude
range. In practice, the error in guessing the number of channel in the patch could be even larger.
Panel B shows the curvature of the log-likelihood function with respect to the channel count
value. The function is smooth and does have a single maximum, although the peak area is quite
shallow.
If the sum of squares cost function is used instead of the log-likelihood, exactly the same
fit (sum of square errors) can be obtained with different values of channel count and rate
constants. This implies that identical trajectories can be generated by different sets of rate
constants, given different values of the channel count. As mentioned previously, the relaxation of
the macroscopic current is a sum of weighted exponentials. Clearly, identical trajectories – up to a
scaling factor – can be generated by identical sets of eigenvalues, and different sets of
eigenvectors, that can be given by suitably different Q matrices. The scaling factor can be
accounted for by a larger or a smaller channel count.
Figure [6.3.11] is an example of two Q matrices that have identical eigenvalues but
different eigenvectors, and that, when associated with a channel count of 10000 (true value) and
8000, respectively, give identical fit lines (and thus identical sum of squares).
6.3.5 Conditioning-test experiments and estimation of starting probabilities
Rate constants can be estimated from macroscopic non-equilibrium data only, since
infinitely many different combinations of rate constants give the same equilibrium current. The
ion channels must therefore start with a non-equilibrium state occupancy and relax towards
261
equilibrium. Hence, the starting state probabilities are a critical parameter of the algorithm. In the
experiments described so far, the simulation started with all channels in the C1 state. The initial
state probabilities were PC1=1, PO=0, PC2=0, while the equilibrium probabilities are 0.118, 0.294
and 0.588. Experimentally, the channels can be forced into a non-equilibrium distribution only if
one or several rates are dependent on external, controllable factors, such as ligand concentration,
voltage etc. This can be accomplished with a conditioning-test experiment, as presented in Figure
[6.3.12]. Let us assume that the model used so far has the k12 rate ligand dependent. The
macroscopic rate k12 is a function of a pre-exponential, constant factor k120 multiplied with the
ligand concentration:
k12  k120  Ligand 
[6.3.1]
The pre-exponential factor k120 is optimized, together with the other rate constants and
channel count. A conditioning pulse applies the ligand in a given concentration, for long enough
as to allow the channels to reach equilibrium. In Figure [6.3.12], the conditioning concentration is
0.13, thus making the macroscopic rate k12 equal to 50. The conditioning equilibrium state
probabilities are 0.571, 0.143 and 0.286. After the conditioning pulse, a test pulse of a different
ligand concentration is applied. In Figure [6.3.12] the test ligand concentration is 1. The
macroscopic rate k12 becomes 500, and the equilibrium probabilities are 0.118, 0.294 and 0.588.
The parameters are estimated from the test part of the recorded data.
Obviously, since the kinetics is not known, the initial state probabilities are not known
either. However, they can be calculated as the equilibrium probabilities of the conditioning
experimental conditions, using the formalism described in Chapter 3. Figure [6.3.13] shows that
the rates and channel count estimated from the conditioning-test experiment have actually less
3
I ignore the concentration units, as the rate constants and the concentrations scale together.
262
scatter than those obtained with the starting probabilities fixed (and known). The results in the
figure demonstrate that the algorithm is capable of optimizing data from conditioning-test
experiments.
6.3.6 Estimating channel count
The last experiment shows how Mac can be used to estimate the channel count from
equilibrium data, regardless of kinetics (Figure [6.3.14]). The length of the simulated data is
10000 points (instead of 4000 so far), thus giving more data at equilibrium. The rate constants of
a simplified C-O model are estimated simultaneously with the channel count. The contribution to
the log-likelihood function of the first data points (black section in the figure) is ignored, and only
the equilibrium data (the red section) is used. The effect is twofold. First, the starting probabilities
can be chosen arbitrarily, as the fitted portion of the data would have reached equilibrium
anyway. Second, a simple, two-state model can be used, since the relaxation towards equilibrium
is contained in the non-equilibrium section of the data, which is ignored.
From 10 data sets optimized together, the algorithm finds any pair of rates and a channel
count that best explain the average and the variance of the current at equilibrium. As mentioned
before, the kinetics is undetermined with equilibrium data. The only limitation imposed on the
estimated rates is that they allow reaching equilibrium during the skipped part. Panels A and B in
Figure [6.3.14] show that the channel count estimate is accurate, with moderate scatter. Using
more then 10 data sets together is likely to reduce the scatter and increase the accuracy (although
the simplified model is not necessarily an unbiased estimator). In order to eliminate the
shortcoming of rate estimates varying wildly and causing numerical problems, I tested the method
with one rate fixed and the other one a free parameter. The only restriction is, again, that the fixed
rate in combination with a variable rate allows the channel to reach equilibrium in time. With the
forward rate thus fixed, the backward rate has to be estimated such as to explain the average
263
current, and it cannot vary too much. Panels C and D in Figure [6.3.14] show the results to be
quasi-identical, but with slightly less scatter.
264
A
1
C
Log-likelihood
499.45
199.41
2
100.1
49.97
3
1
B
499.67
200.85
2
99.75
49.8
3
D
Sum of squares
1
Log-likelihood
500.01
200.0
2
100.0
50.0
Sum of squares
3
1
500.24
201.49
2
99.64
49.83
3
Figure [6.3.2]. Bias of kinetics estimation. Simulated data sampled at 50kHz, channel count 1000,
4000 points per trace. A, B Rate constants optimized from deterministic simulation. C, D Rate
constants optimized from average of 1000 data sets, generated as stochastic simulations. A, C The
likelihood function used as cost function. B, D The sum of squares used as cost function.
265
A
B
50 pA
50 pA
5 ms
5 ms
1 data set
10 data sets
Simulated data
True line
Fit line
Simulated data
True line
Fit line
Figure [6.3.3]. Simultaneous rate constants and channel count estimation. Sampling 50kHz, channel count 1000, 4000 points per trace. A
One data set optimized alone. B 10 data sets optimized together. The 10 data sets are drawn on top of each other. Also overlapped are the
deterministic simulation with the true parameters and the fitted curves for each case.
266
Sum of squares estimation
A
K12
K21
K23
K32
100
0
20
40
60
80
100
B
Count
Data set
6
4
2
0
Avg 512.2
Std 34.71
450
500
K12
550
C
Count
6
4
Avg 211.7
2
Std 59.45
0
D
Count
100
150
Count
250
300
6
4
2
0
Avg 98.64
Std13.29
80
E
200
K21
100
K23
120
6
4
2
0
Avg 49.27
Std 6.306
40
45
50
K32
55
60
Figure [6.3.4]. Rate constants estimation using the sum of squares. Stochastic simulation. Data sets
optimized separately. Sampling 50kHz, channel count 1000, 4000 points per trace. A Logarithmic
scatter plot of estimates. B k12. C k21. D k23. E k32.
267
Log-likelihood estimation
A
K12
K21
K23
K32
100
0
20
40
60
80
100
Data set
B
Count
6
4
Avg 515.1
2
Std 34.34
0
450
500
550
C
Count
K12
8
6
4
2
0
Avg 215.2
Std 59.17
D
Count
100
150
Count
250
300
6
4
2
0
Avg 98.51
Std 13.53
80
E
200
K21
100
K23
120
6
4
2
0
Avg 49.22
Std 6.562
40
45
50
K32
55
60
65
Figure [6.3.5]. Rate constants estimation using the log-likelihood function. Stochastic simulation.
Data sets optimized separately. Sampling 50kHz, channel count 1000, 4000 points per trace. A
Logarithmic scatter plot of estimates. B k12. C k21. D k23. E k32.
268
1000 CC, 1 data set
A
CC
K12
K21
K23
K32
1000
100
0
20
40
60
80
100
B
Count
Data set
40
Avg 720.9
20
Std 42.89
0
C
Count
1,000
2,000
Channel count
3,000
10
Avg 757.4
5
Std 89.63
0
200
400
600
K12
800
D
Count
60
40
Avg 13.46
20
Std 127.1
0
0
100
200
300
E
Count
K21
20
Avg 69.45
10
Std 12.44
0
F
Count
100
200
K23
300
6
4
2
0
Avg 49.13
Std 6.324
40
45
50
K32
55
60
Figure [6.3.6]. Simultaneous rate constants and channel count estimation. Each data set optimized
separately. Sampling 50kHz, channel count 1000, 4000 points per trace. A Logarithmic scatter plot of
estimates. B Channel count. C k12. D k21. E k23. F k32.
269
1000 CC, 10 data sets
A
CC
K12
K21
K23
K32
1000
100
0
20
40
60
80
100
B
Count
Data sets
15
40
10
20
5
Avg 926.2
Std 113.2
0
C
Count
1,000
1,000
Avg 558.5
Std 74.36
Count
Count
400
Count
600
700
Avg 178.4
Std 73.23
150
200
K21
250
300
15
10
5
0
Avg 91.5
Std 11.38
100
F
500
K12
6
4
2
0
100
E
3,000
8
6
4
2
0
300
D
2,000 1,500
Channel count
150
K23
200
8
6
4
2
0
Avg 49.74
Std 1.894
46
48
50
K32
52
54
Figure [6.3.7]. Simultaneous rate constants and channel count estimation. 10 data sets optimized
together. Sampling 50kHz, channel count 1000, 4000 points per trace. A Logarithmic scatter plot of
estimates. B Channel count. C k12. D k21. E k23. F k32.
270
A
B
10 CC
1000
CC
K12
K21
K23
K32
100
100 CC
CC
K12
K21
K23
K32
1000
100
10
0
20
40
60
80
100
0
20
Data sets
C
40
60
80
100
Data sets
D
1000 CC
CC
K12
K21
K23
K32
1000
10000 CC
CC
K12
K21
K23
K32
10000
1000
100
100
0
20
40
60
80
100
0
Data sets
20
40
60
80
100
Data sets
Figure [6.3.8]. Simultaneous rate constants and channel count estimation. 10 data sets optimized together. Sampling 50 kHz, 4000 points per trace. A
Channel count 10. B Channel count 100. C Channel count 1000. D Channel count 10000.
271
A
1000
CC
K12
K21
K23
K32
100
0.0
B
0.2
0.4
0.6
0.8
1.0
Baseline noise std
1000
CC
K12
K21
K23
K32
100
0.20
0.25
0.30
0.35
0.40
0.45
0.50
Open class noise std (excess + baseline)
Figure [6.3.9]. Effect of conductance parameters. Different values of the baseline and of the open
class noise standard deviation. Simultaneous rate constants and channel count estimation. 10 data sets
optimized together. Sampling 50kHz, channel count 1000, 4000 points per trace. A Logarithmic plot
of estimates vs baseline noise. B Logarithmic plot of estimates vs open class noise. The open class
noise is defined as baseline plus open class excess noise. The dotted vertical lines mark the true
values.
272
A
1000
K12
K21
K23
K32
100
10
1
0.1
0.01
600
B
800
1000
1200
1400
1600
Channel count
Log-likelihood
-180000
-184000
LL
-188000
-192000
600
800
1000
1200
1400
1600
Channel count
Figure [6.3.10]. Effect of channel count parameter on rate constants estimation. 10 data sets
optimized together. Sampling 50kHz, channel count 1000, 4000 points per trace. A Logarithmic plot
of rate estimates vs channel count. B Plot of the log-likelihood vs channel count.
273
A
10000 channels (true value)
-495.262


Q10000 :=  193.385


0

eigenvalues:



-294.521 101.136 


49.9512 -49.9512
495.262
0
0.0, -116.88, -722.85
eigenvectors: -0.6710, -0.6710, -0.6710
-0.7951, -0.6074, 0.4533
0.9082, -0.4174, 0.3098
B
8000 channels
-619.078


Q8000 :=  91.5716


0

eigenvalues:
619.078
-170.705
49.9512



79.1337 


-49.9512
0
0.0, -116.88, -722.85
eigenvectors: -0.6878, -0.6878, -0.6878
-0.8733, -0.7084, 0.5287
0.9862, -0.1653, 0.0123
Figure [6.3.11]. Identical trajectories can be accounted for by different Q matrices with different
channel counts. 10 traces of stochastically simulated data, with 10000 channels, were generated and
optimized together for rate constants, using the sum of squares cost function. Channel count was
fixed at either 10000 (panel A), or 8000 (panel B). The two obtained Q matrices have identical
eigenvalues but different eigenvectors. When coupled with the corresponding channel count (10000
or 8000), the two Q matrices give identical fit lines, thus identical sum of squares.
274
A
B
Conditioning equilibrium
Test equilibrium
state probabilities
state probabilities
0.571
0.143
50.0
1
200.0
100.0
2
50.0
0.286
0.118
3
1
0.294
500.0
200.0
2
0.588
100.0
50.0
3
[Ligand] = 1.0
C
Test
[Ligand] = 0.1
Conditioning
Current baseline
D
1
K12
K23
500.0
100.0
200.0
K21
2
50.0
K12 = K120 .[Ligand]
3
50 pA
5 ms
K120 = 500
K32
Figure [6.3.12]. Conditioning-test experiment. The test initial state probabilities are calculated as the equilibrium probabilities of the
conditioning experimental conditions. 10 data sets optimized together. Sampling 50kHz, channel count 1000, 4000 points per trace. A
Conditioning model and equilibrium probabilities. B Test model and equilibrium probabilities. C Stimulus waveform and response. 10 data
sets overlapped. D Macroscopic rate k12 is a function of the microscopic rate k120 and the ligand concentration. k120 is optimized.
275
Conditioning-Test experiment
A
CC
K12
K21
K23
K32
1000
100
0
20
40
60
80
100
B
Count
Data sets
8
6
4
2
0
Avg 982.4
Std 48.88
C
Count
900
Count
1,200
8
6
4
2
0
Avg 522.9
Std 44.51
400
D
1,000
1,100
Channel count
450
500
K12
550
600
6
4
2
0
Avg 193
Std 17.72
160
180
200
220
E
Count
K21
6
4
2
0
Avg 94.53
Std 13.49
80
100
120
140
F
Count
K23
6
4
2
0
Avg 48.24
Std 3.883
42
44
46
48
50
52
54
56
K32
Figure [6.3.13]. Conditioning-Test experiment. The test initial state probabilities are calculated as
the equilibrium probabilities of the conditioning experimental conditions. 10 data sets optimized
together. Sampling 50kHz, channel count 1000, 4000 points per trace. A Logarithmic scatter plot of
estimates. B Channel count. C k12. D k21. E k23. F k32.
276
A
Count
1000
C
0
20
40
60
80
100
Data set
Channel count
6
4
2
0
Avg 941.2
Std 151.3
1,000
Channel count
Count
Channel count
B
D
8
6
4
2
0
Avg 943.3
Std 142.6
1,000
Channel count
F
1000
0
20
40
60
80
1,500
1,500
100 pA
10 ms
100
Data set
E
1
100.0
50.0
2
Figure [6.3.14]. Channel count estimation with a simplified model. 10 data sets optimized together. Sampling 50kHz, channel count 1000,
10000 points per trace. A, B Scatter plot and statistics of channel count estimated with model in E. C, D Scatter plot and statistics of
channel count estimated with model in E, with k12 fixed. E Simplified two-state model, starting in the open state. F The optimizer skips the
first 50 ms in each trace (black section) and fit the remaining (red section).
277
6.4 Discussion
The method proposed here has several advantages over existing methods, the most
important being that the algorithm estimates directly rate constants and not time constants.
Furthermore, the algorithm can estimate simultaneously the kinetics and the number of channels,
which is a novel procedure. If linear relations between rate constants are known to exist, then
linear constraints can be imposed on estimation, and the number of free parameters is thus
reduced.
The kinetics estimation alone is not biased. However, when the channel count is
optimized too, the bias, although small, depends on how many records from the same patch are
available. One record only will result in up to 30% bias for channel count and similar for rates.
With ten records, the bias is down to about 5%. Also, the bias depends on how many channels are
in the patch. With 10-100 channels, the bias is relatively large, whereas with channel count above
1000 the bias is very small. Alternatively, the channel count alone can be estimated with a
simplified model from equilibrium data.
The algorithm estimates directly rate constants. Their identifiability is, however,
determined by the time constants of the multi-exponential relaxation. If a component of the multiexponential is very fast relative to the sampling time it will likely not be detected. Similarly, if a
component is too slow relative to the data length, it will not be detected either. Furthermore, the
weight of a component must be significant. Further work is required to clarify these matters.
I have successfully tested Mac with other models (results not shown). Also, the
optimization was initialized with parameters other than the true values (results not shown). The
algorithm is relatively robust to initial values, but less so than single channel algorithms. Further
work is necessary to establish the behavior of the algorithm with experimental data, which was
not available to me at this time.
278
References
Aitkin, M. and I. Aitkin. 1996. A hybrid EM/Gauss-Newton algorithm for maximum likelihood
in mixture distributions. Statistics and Computing. 6:127-130.
Albertsen, A. and U. P. Hansen. 1994. Estimation of kinetic rate constants from multi-channel
recordings by a direct fit of the time series. Biophys. J. 67:1393-1403.
Balakrishnan, A. V. 1984. Kalman Filtering Theory. University Series in Modern Engineering.
Optimization Software, New York.
Ball, F. G., and J. A. Rice. 1989. A note on single-channel autocorrelation functions. Math.
Biosci. 97:17-26.
Ball, F. G., and J. A. Rice. 1992. Stochastic models for ion channels: introduction and
bibliography. Math. Biosci. 112:189-206.
Ball, F. G., and M. S. P. Sansom. 1988. Aggregated Markov processes incorporating time interval
omission. Adv. Appl. Prob. 20:546 –572.
Ball, F. G., and M. S. P. Sansom. 1989. Ion-channel gating mechanisms: model identification and
parameter estimation from single channel recordings. Proc. R. Soc. Lond. B. 236:385– 416.
Ball, F. G., G. F. Yeo, R. K. Milne, R. O. Edeson, B. W. Madsen, M. S. P. Sansom. 1993. Single
ion channel models incorporating aggregation and time interval omission. Biophys. J. 64:357374.
Ball, F. G., Y. Cai, J. B. Kadane and A. O’Hagan. 1999. Bayesian inference for ion channel
gating mechanisms directly from single channel recordings, using Markov chain Monte Carlo.
Proc. R. Soc. Lond. A. 455:2879-2932.
Baum, L. E., T. Petrie, G. Soules, and N. Weiss. 1970. A maximization technique occurring in the
statistical analysis of probabilistic functions of Markov chains. Ann. Math. Stat. 41:164 –171.
279
Baumgartner, W., H. Schindler and C. Romanin. 1998. Removing non-random artifacts from
patch clamp traces. J. Neurosci. Meth. 82:175:186.
Bayes, T. 1783. An Essay Towards Solving a Problem in the Doctrine of Chances. Phil. Trans.
Roy. Soc. 53:370-418.
Berger, J. 1985. Statistical Decision Theory and Bayesian Analysis (2nd edition), New York:
Springer-Verlag.
Billio, M., A. Monfort and C. P. Robert. 1998. Bayesian estimation of switching ARMA models.
Technical report, CREST, INSEE, Paris.
Blatz, A. L. and K. L. Magleby. 1986. Correcting single channel data for missed events. Biophys.
J. 49:967-980.
Block, S. M., L. S. B. Goldstein and B. J. Schnapp. 1990. Bead movement by single kinesin
molecules studied with optical tweezers. Nature 348:348-352.
Box, G and G. Tiao. 1973. Bayesian Inference in Statistical Analysis, Reading: Addison-Wesley.
Boyen, X. and D. Koller. 1998. Tractable inference for complex stochastic processes. Proc. of the
14th Ann. Conf. on Uncertainty in AI (UAI), Madison, Wisconsin, pp33-42.
Carter, C. K. and R. Kohn. 1996. Markov chain Monte Carlo in conditionally Gaussian state
space models. Australian Graduate School of Management, University of New South Wales.
Chung, S. H., J. B. Moore, L. G. Xia, L. S. Premkumar, and P. W. Gage. 1990. Characterization
of single channel currents using digital signal processing techniques based on hidden Markov
models. Proc. R. Soc. Lond. B. 329:265–285.
Chung, S. H., V. Krishnamurthy, and J. B. Moore. 1991. Adaptive processing techniques based
on hidden Markov models for characterizing very small channel currents buried in noise and
deterministic interferences. Proc. R. Soc. Lond. B. 334:357–384.
Colquhoun, D., A. G. Hawkes and K. Skrodzinski. 1996. Joint distributions of apparent open and
shut times of single-ion channels and maximum likelihood fitting of mechanisms. Phil. Trans. R.
Soc. Lond. A. 354:2555-2590.
280
Colquhoun, D., and A. G. Hawkes. 1977. Relaxation and fluctuations of membrane currents that
flow through drug-operated channels. Proc. R. Soc. Lond. B.199:231-262.
Colquhoun, D., and A. G. Hawkes. 1981. On the stochastic properties of single ion channels.
Proc. R. Soc. Lond. B. 211:205–235.
Colquhoun, D., and A. G. Hawkes. 1987. A note on correlations in single channel records. Proc.
R. Soc. Lond. B. 230:15–52.
Colquhoun, D., and A. G. Hawkes. 1995. The principles of the stochastic interpretation of ion
channel mechanisms. In Single channel recording (2nd ed). B. Sakmann and E. Neher, editors.
Plenum Press, New York. ???-???.
Colquhoun, D., and F. J. Sigworth. 1995a. A Q-Matrix cookbook. In Single channel recording
(2nd ed). B. Sakmann and E. Neher, editors. Plenum Press, New York. 589–633.
Colquhoun, D., and F. J. Sigworth. 1995b. Fitting and statistical analysis of single-channel
records. In Single channel recording (2nd ed). B. Sakmann and E. Neher, editors. Plenum Press,
New York. 483-585.
Cox, D. R., and H. D. Miller. 1965. The Theory of Stochastic Processes. Methuen, London.
Crouzy, S. C. and F. J. Sigworth. 1990. Yet another approach to the dwell-time omission problem
of single-channel analysis. Biophys. J. 58:731-743.
De Gunst, M. C. M., H. R. Künsch and J. G. Schouten. 2001. Statistical analysis of ion channel
data using hidden Markov models with correlated state-dependent noise and filtering. J. Amer.
Stat. Assoc. 96:805-815.
Dempster, A. P., N. M. Laird and D. B. Rubin. 1977. Maximum likelihood from incomplete data
via the EM algorithm (with discussion). J. R. Stat. Soc. B. 39:1-38.
Doucet, A. and C. Andrieu. 1999. Iterative Algorithms for Optimal State Estimation of Jump
Markov Linear Systems. Proc. Conf. IEEE ICASSP.
Draber, S. and R. Schultze. 1994. Detection of jumps in single-channel data containing
subconductance levels. Biophys. J. 67:1404-1413.
281
Elliott, R. J., L. Aggoun and J. B. Moore. 1995. Hidden Markov models: Estimation and control.
New York: Springer-Verlag.
Fessler, J. A. and A. O. Hero. 1994. Space-Alternating Generalized Expectation-Maximization
Algorithm. IEEE Trans. Sign. Proc. 42(10):2664-2677.
Fletcher, R. 1981. Practical Methods of Optimization. John Wiley & Sons, Chichester.
Fletcher, R. and M. J. D. Powell. 1963. A rapidly convergent descent method for minimization.
Computing Journal. 2:163–168.
Forney, G. D. 1973. The Viterbi algorithm. Proc. IEEE. 61:268-278.
Fredkin, R. F. and J. A. Rice. 1986. On aggregated Markov processes. J. Appl. Prob. 23:208-214.
Frey, B. J. and D. J. MacKay. 1997. A revolution: Belief propagation in graphs with cycles. NIPS
10.
Gauvain, J.-L. and C.-H. Lee. 1992. MAP Estimation of Continuous Density HMM: Theory and
Applications. Proc. DARPA Speech & Nat. Lang., Morgan Kaufmann.
Gelman, A., J. B. Carlin, H. S. Stern and D. B. Rubin. 1995. Bayesian Data Analysis, London:
Chapman and Hall.
Ghahramani, Z. 2001. An Introduction to Hidden Markov Models and Bayesian Networks.
International Journal of Pattern Recognition and Artificial Intelligence. 15(1):9-42.
Ghahramani, Z. and G. Hinton. 1996. Switching state-space models. Technical Report CRG-TR96-3, Dept. Comp. Sci., Univ. Toronto.
Goldstein, L. S. B. and A. V. Philp. 1999. The Road less traveled: Emerging principles of kinesin
motor utilization. Ann. Rev. Cell Dev. Biol. 15:141-183.
Hawkes, A. G., A. Jalali and D. Colquhoun. 1990. The distribution of the apparent open times
and shut times in a single channel record when brief events cannot be detected. Phil. Trans. R.
Soc. Lond. A. 332:511–538.
Hille, B. 1992. Ionic Channels of Excitable Membranes. Sinauer Associates, Inc., Sunderland, 2 nd
edition.
282
Horn, R., and K. Lange. 1983. Estimating kinetic constants from single channel data. Biophys. J.
43:207–223.
Howard, J., A. J. Hudspeth and R. D. Vale. 1989. Movement of microtubules by single kinesin
molecules. Nature. 342:154-158.
Jackson, M. B. 1997. Inversion of Markov processes to determine rate constants from singlechannel data. Biophys. J. 73:1382-1394.
Jamshidian, M. and R. I. Jennrich. 1993. Conjugate gradient acceleration of the EM algorithm. J.
Am. Stat. Assoc. 88:221-228.
Jeney, S., F. Sachs, A. Pralle, E. L. Florin and H. J. K. Horber. 2000. Statistical analysis of
kinesin kinetics by applying methods from single chanel recordings. Biophysical J 78, 717Pos.
Jensen, F. V. 1996. Introduction to Bayesian Networks. Springer-Verlag, New York.
Jordan, M. I. (ed.) 1998. Learning in Graphical Models, Cambridge: MIT Press.
Juang, B. H. and L. R. Rabiner. 1991. Hidden Markov models for speech recognition.
Technometrics. 33:251-272.
Kalman, R. E. 1960. A new approach to linear filtering and prediction problems. Trans. Am. Soc.
Mech. Eng., Series D, Journal of Basic Engineering. 82:35–45.
Kalman, R. E. and R. S. Bucy. 1961. New results in linear filtering and prediction theory. Trans.
Am. Soc. Mech. Eng., Series D, Journal of Basic Engineering. 83:95–108.
Keleshian, A. M., G. F. Yeo, R. O. Edeson and B. W. Madsen. 1994. Superposition properties of
interacting ion channels. Biophy. J. 67:634-640.
Kemeny, J. G., J.L. Snell and A.W. Knapp. Denumerable Markov Chains. D. Van Nostrand
Company, 1966.
Kim, C.-J. 1994. Dynamic linear models with Markov-switching. J. of Econometrics. 60:1-22.
Klein, S., J. Timmer and J Honerkamp. 1997. Analysis of multichannel patch clamp recordings
by hidden Markov models. Biometrics. 53:870-884.
283
Koller, D., U. Lerner and D. Angelov. 1999. A general algorithm for approximate inference and
its application to hybrid Bayes nets. Uncertainty in AI. pp324-333.
Lange, K. 1995. A quasi-Newton acceleration of the EM algorithm. Statistica Sinica. 5:1-18.
Lauritzen, S. L. 1992. Propagation of probabilities, means and variances in mixed graphical
association models. J American Statistical Association. 87:1098–1108.
Lauritzen, S. L. 1996. Graphical models, London: Oxford University Press.
Little, R. J. A. and D. B. Rubin. 1987. Statistical analysis with missing data. NewYork: Wiley.
Liu, C. and D. B. Rubin. 1994. The ECME algorithm: A simple extension of EM and ECM with
faster monotone convergence. Biometrika. 81:633-648.
Liu, C., D. B. Rubin and Y. N. Wu. 1998. Parameter expansion to accelerate EM: The PX-EM
algorithm. Biometrika. 85:755-70.
Logothetis, A. and V. Krishnamurthy. 1999. Expectation Maximization Algorithms for MAP
Estimation of Jump Markov Linear Systems. IEEE Trans. Sign. Proc. 47(8):2139-2156.
Marschner, I. C. 2001. On stochastic versions of the EM algorithm. Biometrika. 88:281-286.
McLachlan, G. J. and T. Krishnan. 1997. The EM Algorithm and Extensions. Wiley series in
probability and statistics. John Wiley & Sons, Inc.
Meilijson, I. 1989. A fast improvement of the EM algorithm on its own terms. J. Royal Stat. Soc.
B. 51(1):127-138.
Michalek, S. and J. Timmer. 1999. Estimating rate constants in hidden Markov models by the
EM-algorithm. IEEE Trans. Sign. Proc. 47. 226-228.
Minka, T. P. 2001. A family of algorithms for approximate Bayesian inference. PhD thesis.
Massachusetts Institute of Technology, www.media.mit.edu/˜tpminka/.
Moler, C. B. and C.F. Van Loan. 1978. Nineteen dubious ways to compute the exponential of a
matrix. SIAM Rev. 20:801-836.
Murphy, K., Y. Weiss and M. Jordan. 1999. Loopy-belief propagation for approximate inference:
An empirical study. Uncertainty in AI.
284
Najfeld, I. and T. F. Havel. 1995. Derivatives of the matrix exponential and their computation.
Advances in Applied Mathematics. 16:321-375.
Okada, Y. and N. Hirokawa. 1999. A Processive Single-Headed Motor: Kinesin Superfamily
Protein KIF1A. Nature. 283:1152-1157.
Opper, M. and O. Winther. 1999. A Bayesian approach to on-line learning. On-Line Learning in
Neural Networks. Cambridge University Press.
Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. 1992. Numerical recipes in
C. Cambridge University Press, Cambridge.
Qin, F., A. Auerbach, and F. Sachs. 1996. Estimating single channel kinetic parameters from
idealized patch-clamp data containing missed events. Biophys. J. 70:264 –280.
Qin, F., A. Auerbach, and F. Sachs. 1997. Maximum likelihood estimation of aggregated Markov
processes. Proc. R. Soc. Lond. B. 264:375–383.
Qin, F., A. Auerbach, and F. Sachs. 2000a. A direct optimization approach to hidden Markov
modeling for signal channel kinetics. Biophys. J. 79:1915–1927.
Qin, F., A. Auerbach, and F. Sachs. 2000b. Hidden Markov Modeling for Single Channel
Kinetics with Filtering and Correlated Noise. Biophys. J. 79:1928–1944.
Rabiner, L. R. 1989. A Tutorial on Hidden Markov Models and Selected Applications in Speech
Recognition. Proceedings of the IEEE. 77(2):257-286.
Rabiner, L. R. and B. H. Juang. 1986. An introduction to hidden Markov models. IEEE Ac. Sp.
Sign. Proc. January, 4-16.
Rabiner, L. R., B. H. Juang, S. E. Levinson and M. M. Sondhi. 1986. Recognition of isolated
digits using hidden Markov models with continuous mixture densities. AT&T Tech. J.
64(6):1211–1222.
Rauch, H. E. 1963. Solutions to the linear smoothing problem. IEEE Transactions on Automatic
Control. 8:371–372.
285
Rauch, H. E., F. Tung and C. T. Striebel. 1965. Maximum likelihood estimates of linear dynamic
systems. AIAA Journal. 3(8):1445–1450.
Rice, S., A. W. Lin, D. Safer, C. L. Hart, N. Naber, B. O. Carragher, S. M. Cain, E. Pechatnikova,
E. M. Wilson-Kubalek. M. Whittaker, E. Pate, R. Cooke, E. W. Taylor, R. A. Milligan and R. D.
Vale. 1999. Nature. 402(6763):778-84.
Rosales, R., J. A. Stark, W. J. Fitzgerald and S. B. Hladky. 2001. Bayesian Restoration of Ion
Channel Records using Hidden Markov Models. Biophys. J. 80:1088-1103.
Roux, B., and R. Sauve. 1985. A general solution to the time interval omission problem applied to
single channel analysis. Biophys. J. 48:149 –158.
Roweis, S and Z. Ghahramani. 1999. A Unifying Review of Linear Gaussian Models. Neural
Computation. 11:305–345.
Sachs, F., J. Neil and N. Barkakati. 1982. The automated analysis of data from single ionic
channels. Pflügers Arch. 395:331-340.
Sakmann, B. and E. Neher. 1992. The patch clamp technique. Scientific American, March.
Sakmann, B. and E. Neher. 1995. Single channel recording (2nd ed), Plenum Press, New York.
Shachter, R. 1990. A linear approximation method for probabilistic inference. Uncertainty in AI.
Shumway, R. and D. Stoffer. 1991. Dynamic linear models with switching. J. of the Am. Stat.
Assoc. 86:763-769.
Silberberg, S. D., A. Lagrutta, J. P. Adelman and K. L. Magleby. 1996. Wanderlust kinetics and
variable Ca2+Sensitivity of dSlo, a large conductance Ca2+-activated K+ channel, expressed in
oocytes. Biophysical J. 70:2640-2651.
Sorensen, H. W. (ed). 1985. Kalman filtering: theory and application. IEEE.
Svoboda, K., C. F. Schmidt, B. J. Schnapp and S. M. Block. 1993. Direct observation of kinesin
stepping by optical trapping interferometry. Nature 365:721-727.
286
Swope, S. L., S. I. Moss, L. A. Raymond and R. L. Huganir. 1999. Regulation of ligand-gated ion
channels by protein phosphorylation. Adv. in Second Messenger & Phosphoprotein Res. 33:4978.
Tugnait, J. K. 1982. Adaptive estimation and identification for discrete systems with Markov
jump parameters. IEEE Trans. Automat. Contr. AC-27:1054–1065.
Vale, R. D. and R. A. Milligan. 2000. The Way Things Move: Looking Under the Hood of
Molecular Motor Proteins. Science. 288:88-95.
Venkataramanan, L., J. L. Walsh, R. Kuc, and F. J. Sigworth. 1998. Identification of hidden
Markov models for ion channel currents. I. Colored background noise. IEEE Trans. Signal
Processing. 46:1901–1915.
Venkataramanan, L., R. Kuc, and F. J. Sigworth. 1998. Identification of hidden Markov models
for ion channel currents. II. State-dependent excess noise. IEEE Trans. Signal Processing.
46:1916 –1929.
Venkataramanan, L., R. Kuc, and F. J. Sigworth. 2000. Identification of Hidden Markov Models
for Ion Channel Currents. III. Bandlimited, Sampled Data. IEEE Trans. Signal Processing.
49:376-385.
Visscher, K., M. J. Schnitzer and S. M. Block. 1999. Single kinesin molecules studied with a
molecular force clamp. Nature 400:184-189.
Viterbi, A. J. 1967. Error bounds for convolutional codes and an asymptotically optimal decoding
algorithm. IEEE Trans. Informat. Theory. IT-13:260-269.
Wagner, M., S. Michalek and J. Timmer. 1999. Estimating transition rates in aggregated Markov
models of ion-channel gating with loops and with nearly equal dwell times. Proc. R. Soc. Lond.
B. 266:1919-1926.
West, M and J. Harrison. 1997. Bayesian Forecasting and Dynamic Models (2nd Edn), New York:
Springer-Verlag.
287
Whittaker, J. 1990. Graphical models in applied multivariate statistics. Chichester, England:
Wiley.
Wu, C.F.J. 1983. On the convergence properties of the em algorithm. The Annals of Statistics.
11(1):95-103.
Yedidia, J. S., W. T. Freeman and Y. Weiss. 2000. Generalized belief propagation (Technical
Report TR2000-26). MERL. www.merl.com/reports/TR2000-26/index.html.
Yeo, G. F., R. K. Milne, R. O. Edeson and B. W. Madsen. 1988. Superposition properties of
independent ion channels. Proc. R. Soc. Lond. B. 238:255-170.
288
Download