Applications of Hidden Markov Models - QuB

Applications of Hidden Markov Models to Single Molecule and Ensemble Data Analysis by Lorin S. Milescu 03/27/03 A thesis submitted to the Faculty of the Graduate School of State University of New York at Buffalo in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Physiology and Biophysics Copyright by Lorin S. Milescu 2003 2 Acknowledgements 3 Table of Contents 1. Introduction .................................................................................................................................. 8 2. Background ................................................................................................................................ 15 2.1 Dynamic state space models ................................................................................................ 16 2.1.1 Hidden Markov model ................................................................................................. 18 2.1.2 Linear Gaussian model (LGM) .................................................................................... 29 2.2 Bayesian inference............................................................................................................... 31 2.2.1 Likelihood calculation .................................................................................................. 33 2.2.2 State inference .............................................................................................................. 36 2.2.3 Parameter estimation .................................................................................................... 36 2.3 Basic algorithms .................................................................................................................. 38 2.3.1. Likelihood calculation ................................................................................................. 39 2.3.1.1 Forward-Backward procedure for HMM .............................................................. 39 2.3.2. Most likely sequence of states ..................................................................................... 42 2.3.2.1 Viterbi algorithm for HMM .................................................................................. 42 2.3.2.2 Forward-Viterbi algorithm for aggregated HMM ................................................ 44 2.3.2.3 Kalman filter-smoother for LGM ......................................................................... 45 2.3.3 Parameter estimation .................................................................................................... 47 2.3.3.1 Expectation-Maximization .................................................................................... 49 2.3.3.2 Baum-Welch algorithm for HMM ........................................................................ 51 2.3.3.3 Segmental k-means algorithm for HMM .............................................................. 53 2.3.3.4 Expectation-maximization algorithm for linear Gaussian models ........................ 55 3. A maximum likelihood method for kinetic parameter estimation from single channel data ..... 56 3.1 Algorithm ............................................................................................................................ 60 4 3.1.1 Parameterization and constraints .................................................................................. 60 3.1.2 Likelihood function ...................................................................................................... 63 3.1.3 Computation and derivatives of the likelihood function .............................................. 66 3.1.4 Speeding up the computation ....................................................................................... 75 3.2 Results ................................................................................................................................. 77 3.3 Discussion............................................................................................................................ 78 4. Signal detection and baseline correction of single channel records ........................................... 79 4.1 Hidden Markov models with stochastic parameters or with mixed states ........................... 80 4.2 Baseline linear Gaussian model ........................................................................................... 81 4.3 Algorithms for baseline correction ...................................................................................... 88 4.3.1 Viterbi-Kalman algorithm ............................................................................................ 92 4.3.2. Forward-Backward-Kalman algorithm ....................................................................... 99 4.3.2.1 Initialization ........................................................................................................ 102 4.3.2.2 Expectation ......................................................................................................... 105 4.3.2.3 Maximization ...................................................................................................... 105 4.4 Results ............................................................................................................................... 106 4.4.1 Idealization of simulated data without baseline artifacts ........................................... 111 4.4.2 Idealization and baseline correction of simulated data with stochastic baseline fluctuations .......................................................................................................................... 119 4.4.3 Idealization and baseline correction of simulated data contaminated with sinewave artifacts ................................................................................................................................ 137 4.4.4 Idealization and baseline correction of experimental data ......................................... 158 4.5 Discussion.......................................................................................................................... 163 5. Infinite state-space hidden Markov models. Application to single molecule fluorescence data from kinesin ................................................................................................................................. 166 5.1 Infinite state-space hidden Markov model ........................................................................ 170 5 5.2 Kinesin stepping model ..................................................................................................... 175 5.3 Algorithms for staircase data ............................................................................................. 175 5.3.1 Position estimation – data idealization ....................................................................... 176 5.3.1.1 Modified Forward-Backward-Kalman procedure for staircase data ................... 178 5.3.1.2 Re-estimation ...................................................................................................... 182 5.3.2 Kinetic parameters estimation .................................................................................... 185 5.3.2.1 Parameterization and constraints ........................................................................ 186 5.3.2.2 Likelihood function............................................................................................. 187 5.3.2.3 Computation and derivatives of the likelihood function ..................................... 189 5.3.2.4 Speeding up the computation .............................................................................. 190 5.4 Results ............................................................................................................................... 191 5.4.1 Kinetics estimation of simulated data ........................................................................ 192 5.4.1.1 Bias of the estimate ............................................................................................. 193 5.4.1.2 Precision of the estimate ..................................................................................... 195 5.4.1.3 Effect of jump truncation .................................................................................... 195 5.4.1.4 Effect of A matrix truncation order .................................................................... 197 5.4.1.5 Effect of backward rate ....................................................................................... 198 5.4.1.6 Log-likelihood surface ........................................................................................ 199 5.4.1.7 Reversible model ................................................................................................ 200 5.4.2 Data idealization......................................................................................................... 202 5.4.2.1 Idealization of simulated data ............................................................................. 203 5.4.2.2 Idealization of experimental data ........................................................................ 226 5.5 Discussion.......................................................................................................................... 245 6. Ensemble hidden Markov models. Applications to kinetic analysis of whole-cell recordings 247 6.1 Ensemble hidden Markov model ....................................................................................... 248 6.1.1 Conductance properties of an ensemble of ion channels............................................ 249 6 6.1.2 State fluctuations of an ensemble of ion channels ..................................................... 251 6.2 Algorithm for ensemble hidden Markov model ................................................................ 252 6.2.1 Likelihood function .................................................................................................... 253 6.2.2 Parameterization and constraints ................................................................................ 254 6.2.3 Computation and derivatives of the likelihood function ............................................ 255 6.3 Results ............................................................................................................................... 257 6.3.1 Bias of kinetics estimation ......................................................................................... 257 6.3.2 Simultaneous estimation of kinetics and channel count ............................................. 259 6.3.3 Effect of conductance parameters .............................................................................. 260 6.3.4 Effect of channel count .............................................................................................. 260 6.3.5 Conditioning-test experiments and estimation of starting probabilities ..................... 261 6.3.6 Estimating channel count ........................................................................................... 263 6.4 Discussion.......................................................................................................................... 278 References .................................................................................................................................... 279 7 1. Introduction Recording data from individual molecules started in the late 70’s, with the advent of the single channel patch clamp technique (Sakmann and Neher, 1992, 1995). Ionic fluxes through individual ion channel molecules in a cellular membrane are recorded as currents in the pA range. The development of the technique was accompanied by the development of stochastic models (Colquhoun and Hawkes, 1977, 1981; Ball and Rice, 1992 and references therein; Colquhoun and Hawkes, 1995 and references therein) and statistical methods of analysis (Horn and Lange, 1983; Colquhoun and Sigworth, 1995b and references therein). Starting in the late 80’s, aggregated hidden Markov models were applied to single channel data analysis (Chung et al, 1990, 1991). In the following years, Markov model-based algorithms have become the state of the art analysis tool of single ion channel research (Albertsen and Hansen, 1994; Ball et al, 1999; Colquhoun et al, 1996; De Gunst et al, 2000; Jackson, 1997; Klein et al, 1997; Michalek and Timmer, 1999; Qin et al, 1996, 1997, 2000a, b; Venkataramaran et al, 1998a, b, 2000; Wagner et al, 1999). A Markov model (Elliott et al, 1995) describes a dynamic process characterized by a finite number of discrete states. Only instantaneous transitions between states are possible, with the probability of transition proportional to a rate constant. When the state of the process cannot be observed or measured directly, but through another noisy variable, the Markov model is called hidden (Rabiner and Juang, 1986; Juang and Rabiner, 1991). States with identical observation properties are called aggregated. The lifetime of a state has an exponential probability distribution, while the lifetime of a group of aggregated states has a multi-exponential distribution (Ball and Rice, 1989; Fredkin and Rice, 1986; Colquhoun and Hawkes, 1987, 1995; Hawkes et al, 1990). As a representation of the physical world, the Markov process is continuous in time but, of necessity observed in the discrete time domain. Typically, an analog-to-digital converter 8 samples a (noisy) observation of the Markov state at regular time intervals. What occurs between two sampling points is unknown but can be statistically inferred. Due to the exponential nature of the lifetime distribution, unobserved state transitions will occur between two sampling points. The reconstitution of a continuous sequence of dwells from a discrete Markov signal is thus plagued by the “missed events” problem (Ball and Sansom, 1988; Ball et al, 1993; Hawkes et al, 1990; Roux and Sauve, 1985). A discrete time first-order Markov model is memory-less: the state at any time t is a stochastic function of only the previous state. Due to the finite sampling interval, any state at any time t can be followed by any other state at time t+1 where, for convenience, t is normalized by the sampling interval t. Therefore, in the discrete time domain, transitions between any two states are allowed within the sampling interval, while in continuous time, only some transitions are permitted. The continuous-time state transitions are quantified by the generator matrix, noted Q (Colquhoun and Hawkes, 1995), while the discrete time state transitions are quantified by the transition probability matrix, noted A (Colquhoun and Sigworth, 1995a). It is easy to see how hidden, aggregated Markov models are an accurate description of ion channel molecules. The continuum of conformational states the molecule may assume can be reduced to a finite and usually small number of long-lived states. These states live long enough to be detected with the available instrumentation, which is inherently limited in terms of bandwidth. Historically, ion channels were modeled with two states, closed and open, but as instrumental resolution was improved, more complex Markov models became necessary. For example, modeling the acetylcholine receptor kinetics must account for closed and open states with zero, one or two ligand molecules bound, and for desensitized states. All the closed states are aggregated into a class of closed conductances and all the open states are aggregated into a class of open conductances. The presence of visibly conducting substates required additional classes. Many structure/function properties of the ion channel molecule can be inferred from the single channel current by statistical analysis. Such analysis is generally conducted with the 9 respective aims of estimating conductance and kinetics. The conductance parameters describe the characteristics of the current in each conductance class. Typically, the current is assumed to have a Gaussian distribution with non-correlated noise, and a stationary mean and standard deviation. The probability of an instantaneous transition between two kinetic states is quantified by a macroscopic rate constant. The rate constant is a function of two intrinsic factors and other physical quantities (such as ligand concentration, trans-membrane voltage, pressure etc) that are independent variables. The estimation of the intrinsic factors is the goal of kinetic analysis. In recent years, investigation of other single molecules became possible. One example is the class of motor proteins, including the kinesin and the myosin (Vale and Milligan, 2000). A motor protein converts chemical energy into mechanical energy, by hydrolyzing ATP. The mechanical energy is used for moving (usually in discrete steps) on a specific substrate, and carrying an attached molecular load. Using fluorescence experiments, the position of individual motor proteins can now be measured with accuracy, in the nanometer range (Selvin, 2002). In the sense of allosteric behavior, a motor protein is no different from an ion channel, and can be described with a hidden Markov model. Therefore, it seems natural to attempt using the same or modified versions of the Markov model-based algorithms, originally developed for ion channels to analyze other single molecule experiments. With molecular motors, the recorded quantity is either the position of the molecule or the force required to maintain its position. If no external force is applied, each time the motor takes a step, the observed position changes. Thus, a recording of molecular motor steps has a staircase appearance. In addition to conformation, the state of the molecule must be augmented with the position on the substrate. The position takes discrete values but the number of possible values can be large, sometimes essentially infinite if the substrate is long. In principle, this requires a Markov model with a large number of states, but since the chemistry of each step is the same, the model is reducible. Markov algorithms developed for ion channel analysis can be modified to be applied to staircase data. 10 Often, the membrane patch in a single channel experiment contains several ion channels. The superposition of an ensemble of ion channels can still be modeled with a Markov model (Horn and Lange, 1983; Albertsen and Hansen, 1994; Qin et al, 1996; Klein et al, 1997; Venkataramaran, 1998a). The superposition model has (many) more states, termed meta-states, but the same number of conductance and kinetics parameters. While the number of free parameters is the same, the degree of uncertainty in the data grows with the number of channels, resulting in estimates of lower precision. Furthermore, the computation takes longer, depending exponentially on the number of meta-states. In whole-cell experiments the number of channels contributing to the measured signal is very large, in the order of thousands, tens of thousands or millions. Markov modeling by means of a superposition model is certainly possible, but unrealistic. Thus, the traditional method of extracting kinetic parameters from whole-cell recordings is the fitting of exponentials to mean currents. The present work extends the application of Markov models from single ion channels to other single molecules and to ensembles of molecules. My perspective is essentially practical, and all theoretical developments have been translated into algorithms and implemented in software. Experimental data were not available for some approaches, and thus part of this work utilizes computer simulated data. In the rest of this introductory chapter, I present my original contributions to the field. I have modified an existing method for maximum likelihood estimation of rate constants from single channel data (Qin 2000a). I call the new algorithm MIP (Maximum Idealized Point likelihood). The modification extends the applicability of the method to non-stationary data. For non stationary (non equilibrium) data, from the current estimate of the rate constants the starting probabilities are calculated as the equilibrium probabilities under the conditioning stimulus. The computation of the likelihood function is modified to accommodate as many sets of rate constants as there are experimental conditions affecting the data. The derivatives of the likelihood function are calculated analytically, as in the original version (Qin et al, 2000a). Other features of the 11 original method are preserved, such as the possibility of imposing linear constraints on the estimation of rate constants. While the original algorithm works with raw data, MIP can also analyze idealized data. By pre-calculation of certain quantities, the method is accelerated and approaches the speed performance of the maximum interval likelihood method (Qin et al, 1996, 1997). Data idealization is the procedure by which the underlying Markov signal is extracted from the background noise. Along with the traditional half-amplitude threshold method, several heuristic or hidden Markov algorithms exist (Chung et al, 1990, 1991; Schultze and Draber, 1994). I have combined the Forward-Backward procedure (Baum et al, 1970) and the Viterbi algorithm (Viterbi, 1967; Forney, 1973) into a new method for idealization of data generated by aggregated Markov models. The new procedure, termed Forward-Viterbi, finds the most likely sequence of conductance classes. In comparison, Viterbi finds the most likely sequence of states. Single channel records are often contaminated with baseline artifacts. Drift of the zerolevel current due to cellular or instrumental causes, interference from the power line resulting in 60 Hz and harmonic contamination, or uncharacterized random fluctuations of the baseline level may corrupt the experimental data. Data cannot be analyzed if the baseline uncertainty is too large. While slow and constant drift can be corrected manually or with software (Sachs et al, 1982; Baumgartner et al, 1998; Chung et al, 1991), random fluctuations cannot. To improve baseline tracking, I have modeled baseline fluctuations with a linear Gaussian model, also known as the Kalman filter (Kalman, 1960). I have designed and implemented two maximum likelihood methods that perform idealization with baseline correction for single channel data. They are based on coupling the Kalman filter with either the Viterbi algorithm or the Forward-Backward procedure. The resulting methods are optimal for genuinely stochastic baseline fluctuations but also work well with deterministic drift or sinusoidal interference. I have adapted hidden Markov model algorithms to staircase style data. I replaced the infinite state hidden Markov model with truncated model reflecting the chemistry of each step. 12 The degree of truncation depends on the data. The discrete levels present are detected with a modification of the baseline correction algorithm. The new idealization method deals with wideband noise and baseline drift, and steps that need not have identical amplitude. A modified version of the MIP algorithm estimates rate constants from the idealized data. Use of a truncated, finite state model for kinetics estimation is complemented by a modification in the likelihood function, by which the vector of state occupation probabilities can be reset after each data point. All the features of the original algorithm are preserved, such as imposition of linear constraints on rate constants, fitting across different experimental conditions, etc. To my knowledge, this is the first time an infinite state-space Markov model with the truncation has been applied to the study of single molecules. I have extended application of hidden Markov models to the kinetic analysis of ensemble data, such as obtained in whole-cell recordings. Instead of fitting exponentials, I develop a method for direct estimation of rate constants rather than time constants. More importantly, I use a likelihood function that takes into account the sources of variance in data: fluctuations in the number of channels occupying each state and the excess state noise. If the state noise is a convolution of Gaussians and therefore another Gaussian, the channel fluctuations are described by a multinomial probability distribution. A convolution of the two not being practically feasible, I approximate the multinomial with a Gaussian, and obtain the convolution simply as another Gaussian. The fluctuation (the variance) of data around the conditional mean is a function of the number of channels in the patch. The benefit of including the variance in the likelihood function is that it makes possible a simultaneous estimation of rate constants and of the number of active channels in a patch (or cell). This, to my knowledge, is a novel model based procedure. The only prerequisite for analysis remains, as expected, knowledge of the individual mean channel current and its excess state noise, although detailed knowledge of the latter proves to be not critical. As a matter of principle, to permit rate constants to be estimated, the data must be non-stationary. The ensemble algorithm deals with non-stationarity in the same way as MIP. Importantly, since the 13 ensemble data has fewer details than single channel data, the number of free parameters in the model must be reduced by constraints, and this has been implemented as with MIP and other single-channel algorithms. The following chapters are organized as follows: Chapter 2 gives a background on dynamic state space models, Bayesian inference and standard algorithms; Chapter 3 describes the MIP algorithm and introduces some concepts used later; Chapter 4 presents the procedure for idealization and baseline correction of single channel data; Chapter 5 describes the new methods for single molecule staircase data analysis; Chapter 6 presents the hidden Markov model algorithm for kinetic modeling from ensemble data. The reader should take note that concepts introduced in earlier chapters are not repeated. Therefore, chapters are best read in order. Throughout this text, I used bold notation for vectors, matrices, sequences, sets and symbols denoting other complex objects, such as parameters, optimized variables etc. Matrices and vectors are generally written in uppercase. Scalars and elements of vectors and matrices are written in bold and italic in text and just italic in mathematical expressions, and generally in lowercase. 14 2. Background In this section I discuss the mathematical and algorithmic prerequisites. I start by introducing dynamic state space models, of which the hidden Markov model (HMM) is an important instance (Rabiner and Juang, 1986). I use the hidden Markov modeling framework to derive new methods for direct estimation of rate constants from single channel data (Chapter 3), single molecule fluorescence data (Chapter 5) and ensemble ion channel data (Chapter 6), and for signal detection and baseline correction of single channel and single molecule recordings (Chapters 4 and 5). I present the linear Gaussian model (LGM), also known as the Kalman filter (Kalman, 1960) and apply it to single channel and single molecule data analysis. I combine the linear Gaussian with the hidden Markov model into new algorithms for signal detection, parameter re-estimation and baseline correction of single channel and single molecule records. Next, I review briefly some concepts in Bayesian statistics (Bayes, 1783) pertinent to dynamic state space models. The computation of the likelihood of a data set generated by a dynamic state space model is presented in the light of Bayesian estimation (West and Harrison, 1997). I review two of the most important problems of dynamic state space models: state inference or signal detection, and maximum likelihood and maximum a posteriori parameter estimation. Finally, I present the dynamic programming algorithms used to solve these problems, namely the Forward-Backward procedure (Baum et al, 1970), the Viterbi algorithm (Viterbi, 1967; Forney, 1973), the Kalman filter-smoother (Kalman, 1960; Rauch, 1963), and the Expectation-Maximization algorithm (Dempster et al, 1977) and two of its instances, BaumWelch (Baum et al, 1970) and Segmental k-means (Rabiner et al, 1986). I give a short description of a new state inference algorithm for aggregated Markov models, which I call Forward-Viterbi. 15 2.1 Dynamic state space models A dynamic state space model describes the time progression of a state variable X. The state may not be directly observed, in which case it is termed hidden. Instead, a measurement variable Y, which is a function of X, is observed. The state is hidden if a one-to-one mapping between the state and observation does not exist. I will discuss the concept further in the next section on Markov models. It suffices to say here that, when Y is a probabilistic function of X, different states may give the same observation, with identical or different probability distributions. Models with both hidden and observed variables fall into the incomplete or missing data category (Little and Rubin, 1987). X is the hidden or missing data and Y is the incomplete data. Together, X and Y form the complete data. Parameter estimation from incomplete data could be very difficult or even impossible but it is usually straightforward when the hidden state is known or statistically inferred. X and Y take continuous or discrete values. In some cases, the variables are mixed, with continuous and discrete values. The state and the measurement variables are represented in continuous or discrete time. I focus the discussion on discrete time models. A discrete time dynamic state space model is described by two equations. The process equation gives the discrete time evolution of the state Xt at time t as a deterministic function F of the previous states and previous measurements. A process random variable t, characterized by a probability density/mass function, adds stochastic innovation to the deterministic function F: Xt  F X 0 , X1 ,..., Xt 1 , Y0 , Y1 ,..., Yt 1   t 1 16 [2.1.1] The measurement equation gives the measurement Yt at time t as a deterministic function G of the current and previous states and of previous measurements. A measurement random variable t, characterized by a probability density/mass function, adds noise to the measurement: Yt  G X 0 , X1 ,..., Xt , Y0 , Y1 ,..., Yt 1   t [2.1.2] Dynamic state space models are powerful representations of physical phenomena, and enjoy a widespread application in both engineering and science. This is especially true for the hidden Markov model (HMM) and the linear Gaussian model (LGM). A useful representation of the conditional structure of dynamic state space models is through graphical models (Jordan, 1998; Lauritzen, 1996; Whittaker, 1990). Models are represented by acyclic graphs (Figure 2.1.1): nodes indicate state or measurement variables and arrows indicate conditional dependencies. The variables could be discrete or continuous, and the conditional dependencies could be probabilistic or deterministic. xt-1 xt xt+1 xt-1 xt xt+ 1 yt-1 yt HMM yt+ yt-1 1 yt LGM yt+ 1 Figure [2.1.1]. Graphical representation of two types of dynamic state space models: hidden Markov model (HMM) and linear Gaussian model (LGM). Circles indicate continuous and squares indicate discrete states or measurements. Clear symbols represent hidden and shaded symbols represent observed variables. The observation at time t, yt, is a probabilistic function of the state xt at time t. The basic difference between the two models is the nature of the states: discrete for HMM and continuous for the LGM. 17 2.1.1 Hidden Markov model A discrete state, discrete time, first order hidden Markov model describes a stochastic, memory-less process. The state variable X takes any of NS discrete values, say the integers from 1 to NS. NS is the total number of states of the model. The state X is hidden and the process is observed through an observation variable Y. Y is a continuous or discrete probabilistic function of X. I consider here the case of a scalar Y. Although the process could be continuous in nature, it is observed (sampled) at discrete times only. Usually, an analog-to-digital converter records Y, with a sampling time t. The discrete process is represented as a temporal chain of states. The state at time t+t, Xt+1, depends solely on the state at time t, Xt. X is conventionally represented as a vector of dimension NS. When applied to ion channels (or other single molecules), each conformational state of the channel is mapped into one of the NS discrete values of X. Y is the measured current, generally assumed to have a Gaussian probability distribution, with mean and variance dependent on the state X. The state vector can store the absolute knowledge about the state. In this case, each element of X, xj, is zero, except the element i, corresponding to the known state, which is one. It can also represent the belief or the relative knowledge about each state. Then each element of X is equal to the probability that the model is in state i. In both situations, the sum of all elements is 1. The discrete time HMM is parameterized by three parameters (Rabiner, 1989): the initial state probabilities, the state transition probabilities and the emission probabilities. I denote by  the complete set of parameters: Θ  π , A , B 18 [2.1.1.1]  is the initial state probability vector, of dimension NS. Each element of , i, is the a priori probability that the Markov chain starts in state i. With the obvious constraints that each element is between zero and one and that the sum of all elements equals 1,  is a column vector1: π i  P  x0  i  [2.1.1.2] 0  πi  1 [2.1.1.3] NS π i 1 i 1 [2.1.1.4] There might be a priori reasons to assign certain values to each of the initial state probabilities. For example, in the case of ion channel pulse-protocol experiments, one typically expects the channel to start in a particular state. Thus, one assigns probability one to that state and zero to others. In contrast, for equilibrium experiments, the only reasonable assumption is that the initial probabilities are equal to the equilibrium occupancy probabilities. A is the transition probability matrix, of dimension NS x NS. Each element of A, aij, is the conditional probability that the model is in state j at time t+t, if it was in state i at time t. A is a matrix, with the properties that each element takes values between zero and one and that the elements of each row sum to one: aij  Pxt 1  j | xt  i  [2.1.1.5] 0  aij  1 [2.1.1.6] NS a j 1 ij 1 [2.1.1.7] 1 A probability vector is termed stochastic. Likewise, a matrix with each row a probability vector is termed row-stochastic. 19 A first order HMM has no memory of previous states. Thus, the state at time t+1 depends solely on the state at time t (and not on previous states). The memory-less property is expressed as: Pxt 1 | x0 , x1 ,..., xt   Pxt 1 | xt  [2.1.1.8] Note that a higher order Markov model (of order k) does have memory of previous (k) states but can still be reduced to a first order model by expanding the state space. B is the vector of emission probability distributions, of dimension NS. Each element of B, bi, represents the probability density/mass function of observing data y, conditional on state x=i. For ion channels, the common assumption is that the emission probabilities are Gaussian and that the noise is non-correlated (white). Non-correlated noise implies that the conditional probability density of observing data at time t depends on the current state only and not on previous measurements (nor on previous states): p yt | x0 , x1 ,..., xt , y0 , y1 ,..., yt 1   p yt | xt    p yt | xt  i   N i , i2 | yt [2.1.1.9] [2.1.1.10] Noise correlation can be eliminated by expanding the Markov states (Qin et al, 2000b). The hidden character of a Markov model is conferred first of all by the indirect observation of state X through the measurement Y. This indirect observation is a probabilistic, many-to-one mapping. Distinct states may probabilistically give the same observation. Furthermore, several states may share the same observation probability density, for example Gaussian distributions with identical mean and variance. These states are termed aggregated. An 20 aggregated Markov model is hidden even in absence of noise, because distinct states in the same aggregation class give measurements with identical characteristics. Thus aggregation adds further uncertainty to the problem of state estimation. For ion channels, the NS conformational states are partitioned in NC conductance classes, based on recorded current mean and variance. The states in the same conductance class cannot generally be distinguished although states of the same mean amplitude can be distinguished if their variances are different. Although the state transition probabilities are governed by the A matrix, the primary quantities of physical interest are the rate constants. These are conventionally expressed as a NS x NS rate matrix Q (also termed the generator matrix). Q has the properties that each offdiagonal element, qij, is the macroscopic rate constant of the direct transition between state i and state j, and that each diagonal element, qii, is the negative sum of the off-diagonal elements of row i, so that the sum of each row is zero. When there is no direct transition between i and j, qij equals zero: 0  qij qii   qij i  j  [2.1.1.11] i, j  [2.1.1.12] j i I illustrate the significance of Q and A and their relationship for a two state Markov model, applied to a two state ion channel protein, as follows: q12 O C q21 [2.1.1.13] 21 The closed and the open conformations of the ion channel are denoted C and O, respectively. The channel opening and closing rates are q12 and q21, respectively. I use a basic result in probability theory (see, for example, Cox and Miller, 1965): if the probability of a discrete event A depends on a second discrete event B, then P(A) is obtained by marginalization with respect to B: P A   P A, B  [2.1.1.14] B P A   P A | B   PB  [2.1.1.15] B Suppose B and A are the state probabilities at time t and t+t, respectively, each with two possible values: closed and open. Then A is clearly a function of B. Thus the probabilities associated with each possible state at time t+t are: PCt t   PCt t | Ct   PCt   PCt t | Ot   POt  [2.1.1.16] POt t   POt t | Ct   PCt   POt t | Ot   POt  [2.1.1.17] Since at any time, the channel is either in C or in O, the following relations are obvious: PCt   POt   1 [2.1.1.18] PCt t   POt t   1 [2.1.1.19] PCt  t | Ct   1  POt  t | Ct  [2.1.1.20] POt  t | Ot   1  PCt  t | Ot  [2.1.1.21] 22 The expected number of transitions from state C into state O, and from state O into state C, in a unit of time, is equal to q12 and q21, respectively. Hence, the expected number of transitions within a time interval t is equal to q12t and q21t, respectively. q12 and q21 have dimension of frequency (s-1), while q12t and q21t are dimensionless. t can be chosen small enough as to make q12t and q21t quantities smaller than one. In the limiting case, when t0, q12t and q21t are, respectively, the conditional probabilities that the channel makes a transition from C to O within t, given that it was in C at time t, and likewise from O to C, given that it was in O: POt t | Ct   q12  t [2.1.1.22] PCt t | Ot   q21  t [2.1.1.23] After substitution in [2.1.1.16] and [2.1.1.17], the state probabilities at t+t are given by: PCt  t   PCt   q12  t  PCt   q21  t  POt  [2.1.1.24] POt  t   POt   q12  t  PCt   q21  t  POt  [2.1.1.25] The terms in [2.1.1.24] and [2.1.1.25] can be rearranged and divided with t: PCt  t   PCt   q12  PCt   q21  POt  t [2.1.1.26] POt  t   POt   q12  PCt   q21  POt  t [2.1.1.27] As t0, the following system of differential equations results: 23 dPC  q12  PC  q21  PO dt [2.1.1.28] dPO  q12  PC  q21  PO dt [2.1.1.29] The system can be written in a more compact, matrix form, using the Q matrix as defined previously: d PC dt PO   PC  q PO   12  q21 q12   q21  dP T  PT  Q dt The superscript T [2.1.1.30] [2.1.1.31] denotes vector transposition (the state vector is in column format). The above expression is known as the Kolmogorov differential equation, with the Chapman-Kolmogorov solution: Pt  P0  expQ  t  T T [2.1.1.32] P0 is the initial probabilities vector. The derivation can be extended easily to models with any number of states. Note that the equations governing the time evolution of a single molecule are identical to the equations for a large number of molecules, as known from chemical kinetics. For a discrete time hidden Markov model, the sampling time t is substituted for t in [2.1.1.32], giving the transition probability matrix A as follows: 24 A  expQ  t  [2.1.1.33] The state at any time t+1 can be predicted from the state at time t. More generally, if the state at any time t0 is known, then the state at any time t>t0 can be predicted: Xt 1  Xt  A [2.1.1.34] Xt  k  Xt  A k [2.1.1.35] Xt t0  Xt0  expQ  t  t0  [2.1.1.36] T T T T T T The exponential of a matrix M is formally defined with the following power series:  Mi i 0 i! expM    [2.1.1.37] If the eigenvalues of the Q matrix are all distinct (most ion channel models have distinct eigenvalues, one zero and the others negative), the matrix exponential A can be efficiently obtained from Q by spectral decomposition (Moler and Van Loan, 1978): NS A   A i  expi  t  [2.1.1.38] i 1 Ai’s are the spectral matrices, obtained from eigenvectors, and i’s are the eigenvalues of Q. Note that, in general, no element of A is zero. Even when a direct transition between states i and j is not allowed (qij=0), there is a finite probability that within t such a transition occurs (aij>0), using alternate pathways. As t approaches zero, the diagonal elements of A 25 approach 1, whereas the off-diagonal elements approach zero (the exponential of a zero matrix is the identity matrix): lim aii  1 [2.1.1.39] lim aij  0 [2.1.1.40] t  0 t 0 In contrast, when t approaches infinity, the elements of A approach the equilibrium occupancy probabilities: lim aij  Pj [2.1.1.41] t  Process and measurement equations for a hidden Markov model can be written as: process equation:  Xt  Multinomial Xt 1  A  T  measurement equation: Yt  Gaussian Xt  μ , Xt  V T T [2.1.1.42]  [2.1.1.43] For a model with two states, the probability distribution of Xt is binomial, with mean equal to Xt-1A. When flipping a coin, the binomial probability of an outcome of head or tail is constant and does not depend on the previous flip. In comparison, the probability of state occurrence for a Markov model does depend on the previous state. For a model with more than two states, the state probability is described by a multivariate binomial, i.e. the multinomial distribution.  and V are vectors of dimension NS. Each element of  and V, i and Vi, is equal to the mean and the variance of the Gaussian emission probability distribution for state i, respectively. 26 The reader is reminded here the significance of the multinomial distribution for discrete state models. The multinomial is parameterized by a probability vector . Each element of , i, is equal to the probability of occurrence for discrete outcome i, out of NS possible outcomes. For example, the probabilities that a two state ion channel, at equilibrium, is closed (C) or open (O) are equal to: PC eq | Peq   Peq ,C [2.1.1.44] POeq | Peq   Peq ,O  1  Peq ,C [2.1.1.45] The equilibrium state occupancy vector, Peq, plays the role of . The multinomial distribution can be applied to ensemble models too. For example, for a patch with NC two state, independent channels at equilibrium, the probability that nC channels are closed and nO channels are open is equal to:   P nC eq , nO eq | Peq  NC ! Peq,C nC Peq,O nO nC !nO ! [2.1.1.46] For the general case, when NS discrete outcomes are possible, the multinomial distribution gives the occurrence probability of a set of counts (n1, n2,…, n N S ), in NC independent trials, as follows:   P n1 , n2 ,..., n N S | Φ  NC ! NS n ! i i 1 27 NS  i 1 ni [2.1.1.47] The fraction in the right-hand side of the expression above is known as the multinomial coefficient. In the examples above, Peq can be replaced with a conditional state probabilities vector. Thus, given the state probabilities vector at time t, Pt, the state at t+1 is described by a multinomial with parameter PtA. Thus, the conditional probability that the channel is in state i at time t+1, given Pt, is equal to:   Pst 1  i | Pt   Pt  A i T [2.1.1.48] The univariate Gaussian probability distribution is parameterized by the mean  and the standard deviation  (the square root of the variance), as follows:  N  , 2   1 x   2   |x  exp  2  2  2    1 [2.1.1.49] The multivariate Gaussian probability distribution is parameterized by the mean vector  and the covariance matrix V, as follows: N μ, V  | x   1  T exp   x  μ   V 1  x  μ   2  2V 1 [2.1.1.50] The diagonal elements of V, Vii, represent the variance of each dimension of the Gaussian. The off-diagonal elements of V, Vij, represent the co-variance between dimensions i and j of the Gaussian. 28 2.1.2 Linear Gaussian model (LGM) The state X of a discrete time linear Gaussian model (LGM) is a vector of dimension dx, indirectly observed through the noisy observation vector Y, of dimension dy. X and Y take continuous values. A linear Gaussian model is completely specified by the set of parameters  and by the process and measurement equations (Roweis and Ghahramani, 1999): process equation: Θ  x 0 , V0 , A ,C ,Q , R [2.1.2.1] Xt  A  Xt 1  t 1 [2.1.2.2] measurement equation: Yt  C  X t  t [2.1.2.3] x0 and V0 represent the a priori knowledge about the initial state of the process. The initial state X0 is generally not known exactly and thus it is specified as a Gaussian probability density function, with mean vector x0 and covariance matrix V0: X 0  N x 0 , V0  [2.1.2.4] The more precisely the initial state X0 is known, the lower the entries in the covariance matrix V0 and vice versa. A is a dx x dx state transition matrix, which is not necessarily row stochastic. C is the observation matrix, of dimension dy x dx.  is the process random Gaussian variable, with zero mean and covariance matrix Q, of dimension dx x dx.  is the measurement random Gaussian variable, with zero mean and covariance matrix R, of dimension dy x dy. The LGM is a memory-less process, in the sense that the state at time t+1 is a function of the current state only (similar to the HMM): 29 pXt 1 | X0 , X1 ,..., Xt , Y0 , Y1 ,..., Yt 1   pXt 1 | Xt  [2.1.2.5] From the process equation [2.1.2.2] it is obvious that the conditional probability density of Xt given Xt-1 is a multivariate Gaussian: pXt | Xt 1   N A  Xt 1 ,Q [2.1.2.6] The conditional probability density of Xt given a previous state specified with a probability distribution of mean xt-1 and variance Vt-1 is another multivariate Gaussian:  pX t | x t 1 , Vt 1   N A  X t 1 ,Q  A  Vt 1  A T  [2.1.2.7] The observation noise is non-correlated (white), thus the measurement at time t, Yt, depends only on the state at time t: pYt | X 0 , X1 ,..., Xt , Y0 , Y1 ,..., Yt 1   pYt | Xt  [2.1.2.8] The conditional probability density of observing Yt given Xt is a multivariate Gaussian: pYt | Xt   N C  Xt , R  [2.1.2.9] The conditional probability density of observing Yt given a state specified as a probability distribution with mean xt and variance Vt is another multivariate Gaussian: 30  pYt | x t , Vt   N C  x t , R  C  Vt  CT  [2.1.2.10] 2.2 Bayesian inference Essentially, the Bayesian theory provides a way to improve/refine/change a prior belief or knowledge about a quantity (for example, a priori parameter estimates), by taking into account additional information (for example, experimental data). Most algorithms for dynamic state space models rely on Bayesian formalism (Baum et al, 1970; Berger, 1985; West and Harrison, 1997; Ghahramani, 2001). The fundamental result of Bayesian inference is written as (Bayes, 1783; Box and Tiao, 1973): Posterior  Likelihood  Prior Evidence [2.2.1] The Bayesian theorem can be applied to parameter estimation (Gelman et al, 1995; Jensen, 1996; West and Harrison, 1997): pParameters | Data   pData | Parameters  pParameters pData  [2.2.2] The prior, p(Parameters), represents the a priori knowledge about Parameters, before observing data. The likelihood, p(Data|Parameters), represents the conditional probability of having observed the data, given the parameters. The posterior, p(Parameters|Data), represents the a posteriori knowledge about parameters, after the data was observed. The evidence, p(Data), is the probability of having observed the Data, regardless of parameters. The evidence is 31 computed from the conditional probability of observing data given parameters, integrated over all possible values of the parameters: pData    pData | Parameters  pParametersdParameters [2.2.3] Parameters It should be noted that the posterior, prior, likelihood and evidence are, in general, probability distributions. Expression [2.2.2] can be understood in terms of joint and conditional probabilities. If A and B are two probabilistic events, then the following relations hold (Cox and Miller, 1965): A  f B  P A, B  P A | B PB [2.2.4] P A   P A, B  [2.2.5] B P A, B  PB , A  P A | B PB  PB | A P A P A | B   A  f B  P  B | A  P  A P B  P A, B  P A  PB [2.2.6] [2.2.7] [2.2.8] P(A,B) is the joint probability of A and B, and P(A|B) is the conditional probability of A, given B. The expression of the posterior [2.2.2] can be derived along the same lines, by replacing A with Parameters and B with Data, in expression [2.2.7]. The likelihood [2.2.3] is obtained from [2.2.5], with the same replacements. If the parameter space is continuous, the sum is replaced with an integral. 32 2.2.1 Likelihood calculation I apply now Bayesian inference to hidden Markov and linear Gaussian models. I use a parametric model M (either HMM or LGM), a parameter estimate , a sequence of observations Y = {y0,…,yt,…,yT} and a sequence of hidden states X = {x0,…,xt,…,xT}. In view of parameter estimation, I write Bayes’ theorem as: pY | θ , M   pθ |   pY | M  pθ | Y , M   [2.2.1.1] All the terms in the right hand side of equation [2.2.1.1] are not actual numbers but probability distributions. Therefore, the posterior, p(|Y,M), must be interpreted as a probability distribution too. For clarity of reading, I eliminate the model M from further expressions. Nevertheless, M should be considered an implicit factor. Both the likelihood, p(Y|), and the evidence, p(Y), are functions of the hidden state X. Since X is not known, due to measurement uncertainty (noise) and possibly aggregation, p(Y|) is termed the incomplete data likelihood: lIC  pY | θ  [2.2.1.2] Computation of the incomplete data likelihood requires marginalization with respect to X, by which all possible sequences X and their intrinsic probability of occurrence are accounted for. Then, for a continuous process, the incomplete data likelihood is: lIC   pY , X | θdX X 33 [2.2.1.3] lIC   pY | X ,θ  pX | θdX [2.2.1.4] X For hidden Markov models, where the state takes discrete values, the integral over the state-space X is replaced with a sum, as follows: lIC   pY | X ,θ pX | θ [2.2.1.5] X The complete data likelihood is defined as the conditional joint probability of observing data Y and a state sequence X, given parameters : lC,X  pY , X | θ [2.2.1.6] lC,X  pY | X ,θ pX | θ [2.2.1.7] For a memory-less hidden Markov or linear Gaussian model, with non-correlated noise, the two terms in the right-hand side of equation [2.2.1.7] are expanded into: pY | X ,θ   p yt | y0 ,..., yt 1 , yt 1 ,..., yT , x0 ,..., xT , θ [2.2.1.8] t pX | θ   pxt | y0 ,..., yT , x0 ,..., xt 1 , xt 1 ,..., xT , θ [2.2.1.9] t Due to the memory-less character and to non-correlated noise, the following simplifications apply: 34 p yt | y0 ,..., yt 1 , yt 1 ,..., yT , x0 ,..., xT , θ  p yt | xt ,θ [2.2.1.10] t>0: pxt | y0 ,..., yT , x0 ,..., xt 1 , xt 1 ,..., xT , θ  pxt | xt 1 ,θ [2.2.1.11] t=0: pxt | y0 ,..., yT , x0 ,..., xt 1 , xt 1 ,..., xT , θ  px0 | θ [2.2.1.12] The complete data likelihood takes the form: T lC, X  p y0 | x0 ,θ   px0 | θ    p yt | xt , θ   pxt | xt 1 , θ  [2.2.1.13] t 1 Expression [2.2.1.13] has specific forms for HMM and for LGM: HMM: lC, X  bx0  y0    x0  T  b  y  a t 1 xt t [2.2.1.14] xt 1 xt T LGM: lC, X  N C  x0 , R  |y0 N x 0 , V0  |x0  N C  x t , R  |yt  N A  x t 1 ,Q  |xt [2.2.1.15] t 1 The incomplete data likelihood for hidden Markov models is calculated differently and will be presented in section 2.3.1.1. In order to compute the evidence, p(Y) must be marginalized with respect to both state X and parameters . This is accomplished by integrating the incomplete data likelihood with respect to . Expression [2.2.1.1] changes to:     pY | X ,θ   pX | θ dX   pθ    X   pθ | Y        pθ dθ     p Y | X , θ  p X | θ d X θ  X   35 [2.2.1.16] 2.2.2 State inference State inference is the procedure by which probabilities (for discrete states) or probability densities (for continuous states) are assigned to all possible states. The most likely sequence of states, XML, is of foremost importance. XML is defined as the argument which maximizes the complete data likelihood for a given parameter estimate : X ML  arg max  pY | X ,θ  pX | θ [2.2.2.1] X The most likely sequence of states has slightly different meanings for hidden Markov and for linear Gaussian models. The XML of an HMM is simply a sequence of numbers, i.e. the state index at each time. LGMs have continuous states and thus, by necessity, XML is a sequence of probability densities. The increase in complexity is, however, minimal, since the probability densities are Gaussian and can be specified, regardless of sequence length, with only two parameters, namely the mean, xt, and the variance, Vt. This is a direct consequence of the properties of random Gaussian variables (Cox and Miller, 1965): i). a linear transformation of a Gaussian variable is another Gaussian, and ii). the sum of two Gaussian variables is another Gaussian (i.e. the convolution of two Gaussian densities is a Gaussian density). 2.2.3 Parameter estimation The maximum a posteriori (MAP) parameter estimate, MAP, is, by definition, the argument  that maximizes the posterior in expression [2.2.1.1]: 36 θ MAP  arg max  pθ | Y  [2.2.3.1]  Since the denominator in expression [2.2.1.1] is not a function of , it can be dropped from maximization: θ MAP  arg max  pY | θ   pθ     θ MAP  arg max lIC  pθ   [2.2.3.2] [2.2.3.3] The expressions above are obviously functions of the unknown state sequence X. When the state sequence is known, a MAP estimate is obtained using the complete data likelihood, lC,X:   θ MAP  arg max lC, X  pθ   [2.2.3.4] Sometimes, there is no a priori knowledge about parameters, before data is available. Equivalently, the parameter prior, p(), can be considered to be a uniform probability distribution, which means that all possible values are equally likely. In this case, the parameter prior can be dropped out of maximization. The maximum a posteriori becomes the maximum likelihood (ML) estimate, ML. ML maximizes only the likelihood term (either the incomplete or the complete data likelihood):   [2.2.3.5]   [2.2.3.6] θ ML  arg max lIC  θ ML  arg max lC, X  37 Clearly, the likelihood term in expression [2.2.3.2] depends on the amount of data. When a large amount of data is available then the likelihood term dominates the maximization and the parameter prior becomes irrelevant. When data is scarce, however, the prior might determine a smoother, better estimate. In the field of single channel analysis, the ML parameter estimate is almost always sought, due to lack of a parameter prior. Recent work, however, proves that MAP estimation is possible, although at great computational cost (Gauvain and Lee, 1992; Ball et al, 1999; De Gunst et al, 2001; Rosales et al, 2001). 2.3 Basic algorithms In this section I review some of the standard algorithms for likelihood calculation, parameter estimation and sequence estimation. Likelihood calculation is a necessary prerequisite for parameter estimation and optimal sequence finding. For hidden Markov models, the Forward-Backward algorithm calculates the incomplete data likelihood and Viterbi calculates the complete data likelihood. Implicitly, both ForwardBackward and Viterbi find the most likely sequence of states, although differently defined. For LGMs, the Kalman filter-smoother calculates the complete data likelihood and finds the most likely sequence of states (probability densities). The Expectation-Maximization (EM) algorithm finds maximum a posteriori or maximum likelihood parameter estimate from incomplete data. The Baum-Welch algorithm is a particular form of EM for maximum likelihood parameter estimation in HMMs. There are other variants of EM, such as the generalized and the stochastic EM, as well as the Segmental k-means algorithm for HMMs. General numerical optimization methods are an alternative to EM. 38 In the next section I discuss only hidden Markov models with continuous emission probabilities, as they alone are relevant for the present work. 2.3.1. Likelihood calculation Since hidden Markov models have discrete states, both the incomplete and the complete data likelihood can be calculated. For linear Gaussian models, with continuous states, only calculation of the complete data likelihood is meaningful. 2.3.1.1 Forward-Backward procedure for HMM In order to calculate the incomplete data likelihood, lIC, the complete data likelihood must be computed for every possible state sequence: lIC    pY | X ,θ  pX | θ [2.3.1.1.1] X The total number of possible sequences grows exponentially with the sequence length. For a model with NS states and a sequence of length T, the number of combinations is NST. Calculating the likelihood by iterating all possible sequences is generally impossible. Fortunately, the Forward-Backward procedure (Baum et al, 1970), a dynamic programming algorithm, calculates the likelihood with a computational cost that grows linearly with the sequence length. Strictly speaking, only the forward part is necessary for likelihood calculation. I will also describe the backward part since it is used in Baum-Welch (Baum et al, 1970) for parameter estimation. 39 For a given model M and parameter estimate , the Forward-Backward procedure involves the computation of two recursive variables, termed  and , at each time t, for each state i:  ti  p y0 , y1 ,..., yt , xt  i | θ  [2.3.1.1.2]  ti  p yt 1 , yt  2 ,..., yT | xt  i, θ  [2.3.1.1.3] ti is the conditional joint probability of observing the data sequence up to time t and of the process being in state i at time t, given parameters  ti is the conditional probability of observing the data sequence from time t+1 to T, given the process is in state i at time t and the parameters The recursivity, forward for  and backward for , is written and initialized as:  NS   i 1   tj1   ti  aij   b j  yt 1  [2.3.1.1.4]    aij  b j  yt 1    t j 1  [2.3.1.1.5]  0i   i  bi  y0  [2.3.1.1.6]  Ti  1 [2.3.1.1.7] i t NS j 1 The reader is reminded that aij is an element of the transition probability matrix A; bj(.) is an element of the emission probability vector B; i is an element of the initial probability vector . The incomplete data likelihood can be obtained from , from  or from both: 40 NS lIC   Ti [2.3.1.1.8] i 1 NS  l    i  bi  y0    0i IC  [2.3.1.1.9] i 1 NS  l    ti   ti IC  [2.3.1.1.10] i 1 Other quantities calculated by the Forward-Backward procedure are:  ti  pxt  i | y 0 , y1 ,..., yT , θ  [2.3.1.1.11]  tij  pxt  i, xt 1  j | y0 , y1 ,..., yT , θ  [2.3.1.1.12] ti is the conditional probability that the process is in state i at time t, given the whole observation sequence Y and parameters tij is the conditional joint probability that the process is in state i at time t, and in state j at time t+1, given the whole observation sequence Y and the parameters .  and  are calculated as:  ti   ti   ti NS  j 1   ij t j t [2.3.1.1.13]  t j  ti  aij  b j  yt 1    t j1 NS NS  k 1 l 1 k t  akl  bl  yt 1    [2.3.1.1.14] l t 1 The Forward-Backward procedure is, in fact, an optimal filter-smoother algorithm for discrete time dynamic models with discrete states. The Forward part obtains a filtered state 41 estimate, while the Backward part refines it into a smoothed state estimate, given the noisy data and the model parameters. A way of calculating  and  in matrix form is given in Chapter 3, along with computational details and ways to prevent numerical over- or underflow. The reason behind the choice of initializing  will also become apparent in Chapter 3. 2.3.2. Most likely sequence of states For a hidden Markov model, the most likely sequence of states can be obtained in two ways: with the Forward-Backward procedure or with the Viterbi algorithm. Forward-Backward gives a most likely filtered or smoothed state at time t, as, respectively: xtML  arg max  ti [2.3.2.1] i xtML  arg max  ti [2.3.2.2] i For aggregated HMMs, states in the same aggregation class have the same emission probabilities. It is therefore realistic to find the most likely sequence of classes. It is interesting to observe that, for ion channels, the optimal sequence is not necessarily limited to kinetic states or to conductance classes. For example, in data from acetylcholine receptors, one might want to find an optimal sequence of clusters of activity, separated by desensitized intervals. 2.3.2.1 Viterbi algorithm for HMM 42 One way to find the most likely state sequence is to calculate the complete data likelihood for each possible sequence and to choose the maximum: XML  arg max  pY | X, θ   pX | θ  [2.3.2.1.1] X As indicated previously, this approach is unrealistic. Fortunately, the Viterbi dynamic programming algorithm (Viterbi, 1967; Forney, 1973) finds XML in a much more efficient way, with the computational cost a linear function of the sequence length. Like the Forward part of the Forward-Backward procedure, Viterbi involves a recursive calculation of the likelihood. The difference is that, whereas Forward calculates t+1j as a sum of previous ’s multiplied by the corresponding transition and emission probabilities, Viterbi replaces the sum with a maximum:    tj1  max  ti  aij   b j  yt 1  i [2.3.2.1.2] It can be shown that t+1j is equal to the complete data likelihood for the state sequence that terminates in state j at t+1 (Viterbi, 1967; Forney, 1973). Furthermore, this sequence has the highest complete data likelihood of all possible sequences terminating in state j at t+1. To store this optimal sequence, for each t and each j, an additional variable  records the most likely previous state:  t j 1  arg max  ti  aij   b j  yt 1  i   The recursion is initiated at time zero as: 43 [2.3.2.1.3]  oj   j  b j  y0  [2.3.2.1.4] From time T, XML can be reconstructed backwards. The optimal state for the last data point is chosen as that which terminates the sequence with the highest score. The second to the last optimal state is retrieved from the  variable, and so on, down to the first data point:   xT  arg max  Tj [2.3.2.1.5] j xt   txt11 [2.3.2.1.6] It is important to recognize that Viterbi is, in a sense, a filter. The back-tracing step must not be mistaken for a second, smoothing pass. The back trace makes no statistical decision. Viterbi is equivalent to the Forward calculation in that the inference for the state at time t uses only the data points up to and including time t. In comparison, Forward-Backward infers the state at time t from all the data available, albeit in two passes. Naturally, the state inference of a smoothing algorithm is more accurate, as it will become obvious from the results section in Chapter 4. A filter has its advantages though. First of all, it is a single pass and thus it is approximately twice as fast. Second of all, it is simpler to code and, third of all, it is suited for online applications. 2.3.2.2 Forward-Viterbi algorithm for aggregated HMM Viterbi is a suboptimal algorithm for aggregated models, because states in the same aggregation class cannot be distinguished based on observation probabilities. I very briefly present here a new method, which I call Forward-Viterbi, for estimation of the most likely 44 sequence of classes. The algorithm is a simple combination of the Forward procedure with the Viterbi algorithm. A detailed description is therefore not necessary. The Viterbi part finds a most likely sequence of classes but does not choose a particular state within the class. Instead, the state probabilities inside a class are advanced forward in time by the Forward part. The obtained sequence is marginalized with respect to classes but not marginalized with respect to states. The class sequence maximizes the class-complete, state-incomplete data likelihood. For ion channels, the aggregation could be based on conductance classes but any other arbitrary state partitioning may be used. The method could be applied, for example, to identification of clusters of activity in acetylcholine receptor data. 2.3.2.3 Kalman filter-smoother for LGM The complete data likelihood of a linear Gaussian model is: T lC, X  N C  x0 , R  |y 0  N x 0 , V0  |x 0  N C  xt , R  |y t N A  xt 1 ,Q  |x t [2.3.2.3.1] t 1 The Kalman filter-smoother (Kalman, 1960; Kalman and Bucy, 1961; Rauch, 1963; Rauch et al, 1965) finds the sequence of states which maximizes [2.3.2.3.1]. Since the states of an LGM take continuous values, the optimal sequence is by necessity obtained as a sequence of Gaussian probability densities, each of mean xt and variance Vt. In other words, the state at any time t is not known exactly but it is believed to be located at xt. The stronger the belief, the lower the variance Vt and vice versa. Therefore, the optimal sequence maximizes the belief in the state values, or, equivalently, minimizes the mean square error. Expression [2.3.2.3.1] can be written as: 45 T T    lC, X   py t | x t    px0    pxt | xt 1  t 1  t 0    lC, X  pY | X  pX [2.3.2.3.2] [2.3.2.3.3] p(Y|X) alone in expression [2.3.2.3.3] is maximized by a state sequence where Yt=CXt, for any t. In contrast, p(X) alone is maximized by a state sequence where Xt=x0, for any t. Obviously, the optimal sequence must balance the two terms, p(Y|X) and p(X). The Kalman filter-smoother is similar to the Forward-Backward procedure. The filter step calculates the probability distribution of a state given past data – the equivalent of ’s in Forward-Backward. The smoother step calculates the probability distribution of a state given the whole data – the equivalent of ’s. The result is a minimum mean square error (MMSE) estimate of the state sequence. The filter step calculates in the forward direction the following recursive quantities (Kalman, 1960; Kalman and Bucy, 1961): ~ xt  A  xˆ t 1 t>1 [2.3.2.3.4] ~ ˆ  AT  Q Vt  A  V t 1 t>1 [2.3.2.3.5]  [2.3.2.3.6]  ~ ~ K t  Vt  CT  C  Vt  CT  R 1 ˆt ~ x xt  K t  y t  Χ  ~ xt  [2.3.2.3.7] ~ ~ ˆ V V t t  K t  C  Vt [2.3.2.3.8] Kt is the so-called Kalman gain matrix. x̂ t and V̂t are the filtered estimates of the state’s mean and covariance. 46 The smoother step calculates in the backward direction the following recursive quantities (Rauch, 1963; Rauch et al, 1965):   1 ~ ˆ A V J t 1  V t 1 t [2.3.2.3.9] ˆ t 1  J t 1  x t  A  x ˆt x t 1  x   [2.3.2.3.10] ~ T ˆ J  V V Vt 1  V t 1 t 1 t t  J t 1 [2.3.2.3.11] ˆT xT  x [2.3.2.3.12] ˆ VT  V T [2.3.2.3.13] xt and Vt are the smoothed estimates of the state’s mean and covariance. The application of the Kalman algorithm is conditioned by strict conformity with the linear Gaussian model. Any violation from the LGM assumptions (e.g. non-Gaussian transition or emission probabilities, non-linear process or measurement equations) requires the use of an exact algorithm for an approximate and tractable model (for example, the extended Kalman filter (Balakrishnan, 1984; Sorensen, 1985)) or an approximate and tractable algorithm for an exact model (for example, Monte Carlo filters (Carter and Kohn, 1996)). 2.3.3 Parameter estimation In this section I review the most common methods of parameter estimation in discrete time dynamic stochastic models, specifically for HMMs. As I have mentioned earlier, the state sequence X is not directly observed – it is hidden. Instead, a many-to-one, non invertible transformation of X is measured, i.e. the observation sequence Y. Naturally, not knowing the state makes parameter estimation more difficult. If X were known, a maximum likelihood (ML), 47 or maximum a posteriori (MAP) parameter estimate of the complete data (Y and X) is obtained by maximizing either of the following expressions: θ MAP  arg max  pY | X , θ   pX | θ   pθ  [2.3.3.1] θ ML  arg max  pY | X , θ   pX | θ  [2.3.3.2]   For HMMs, the ML estimate, at least, has analytical solutions for the set of parameters Θ  π , A , B. The analytical solvability of the MAP estimate depends on the particular prior used. It is clear though, that only the transition probabilities and not the transition rates can be obtained this way. The state sequence cannot be known because it is usually buried in noise and/or because the model has aggregated states. Thus, even in absence of noise, only the aggregation class sequence can be exactly known. In contrast, the state sequence must be inferred statistically. Consequently, parameter estimates must be obtained by maximizing the incomplete data likelihood:    θ MAP  arg max   pY | X , θ  pX | θ  pθ    X  [2.3.3.3]   θ ML  arg max  pY | X , θ   pX | θ   X  [2.3.3.4] Notice the vicious circle: in order to infer the state sequence X, the parameters  must be known. Vice versa, in order to find the parameters , the state sequence X must be known. This circular reasoning can be solved, with an iterative procedure which alternates two steps. First, an 48 initial set of parameters 0 is chosen. Then a state inference step finds the most likely state sequence Xi corresponding to the current parameters i. A maximization step follows, which reestimates the parameters i+1 from the current state sequence Xi. The iteration of inference/reestimation steps terminates when some convergence criterion is reached (although convergence is not necessarily guaranteed). A simple pseudo-code description is: initialize:  0, i:=0 repeat inference: Xi←i re-estimation: i+1 ← Xi i:=i+1 until convergence test In the next sections I present the most common implementations and the theoretical ground of this iterative scheme. 2.3.3.1 Expectation-Maximization The problem of parameter estimation from incomplete data was formalized by (Dempster et al, 1977). The solution is the general likelihood maximization method known as ExpectationMaximization2 (Dempster et al, 1977; Little and Rubin, 1987; McLachlan and Krishnan, 1997). Thus, parameter estimates can be obtained by maximizing the conditional expectation of the loglikelihood of the complete data, given the incomplete data and a current parameter estimate ’. This expectation is termed Q(, ’), and is defined as: 2 Also known as Estimate-Maximize 49 Qθ , θ'   Eln pX , Y | θ | Y , θ'    ln px , Y | θ  px | Y , θ' dx [2.3.3.1.1] X The ML and MAP parameter estimates can be calculated as: θ ML  arg max Q θ , θ'  [2.3.3.1.2] θ MAP  arg max Q θ , θ'   ln pθ  [2.3.3.1.3]   EM consists in an iterative sequence of two steps: i) calculate the expectation Q(, ’) for a given parameter estimate and ii) find a new parameter estimate , which maximizes the expectation (or its sum with the logarithm of the parameter prior). A brief description of the algorithm is: initialize: 0, i:=0 repeat expectation: calculate Q(, i) maximization: i+1:=arg max[Q(, i)] i:=i+1 until convergence test EM should be applied to those estimation problems in which the likelihood function is difficult or impossible to maximize or to differentiate. The observed data in these problems is considered a many-to-one function of a random unobserved variable, the knowledge of which 50 would make finding maximum likelihood or maximum a posteriori parameter estimates straightforward. The algorithm is proven to converge to a local maximum of the likelihood function (Dempster et al, 1977; Wu, 1983). The main drawback of this powerful, robust algorithm is its slow rate near the convergence point, although several ways of improving convergence speed have been proposed (Meilijson, 1989; Aitkin and Aitkin, 1996; Jamshidian and Jennrich, 1993; Liu and Rubin, 1994; Lange, 1995; Liu et al, 1998). Additionally, sometimes the maximization of Q(, ’) has no analytical solution or it is difficult to compute. In this case, any increase in Q(, ’), instead of maximization, is guaranteed to converge to a solution, even though not so efficiently. This procedure is known as the Generalized Expectation-Maximization algorithm (Dempster et al 1977; Fessler and Hero 1994). When the expectation Q(, ’) cannot be calculated analytically, it is replaced with a stochastic, approximate solution. This method is known as the Stochastic Expectation-Maximization algorithm (see Marschner, 2001, and references therein). 2.3.3.2 Baum-Welch algorithm for HMM The Baum-Welch algorithm (Baum et al, 1970) performs maximum likelihood parameter estimation in HMMs (maximum a posteriori variants are also known (Gauvain and Lee, 1992)). Baum-Welch is a particular instance of EM, although it was developed before EM. The expectation step of Baum-Welch uses the Forward-Backward procedure for state inference. The re-estimation formulae obtain a new parameter estimate in the maximization step. The Q function for HMMs replaces the integral over X with a sum: Qθ , θ'    ln px , Y | θ  px | Y , θ'  X 51 [2.3.3.2.1] The joint probability of the data and the state sequence, p(x,Y|), is a product of three terms (see [2.2.1.14]):   T  T  px , Y | θ    x0   a xt 1xt    bxt  yt   t 1   t 0  [2.3.3.2.2] Thus Q is a sum of three independent terms:     Q θ , θ'    ln  x0  px | Y , θ'   x  T ln a xt 1xt x    t 1   px | Y ,θ'      [2.3.3.2.3]       lnb  y   px | Y , θ'  T x  t 0 xt t  The Baum-Welch re-estimation formulae find a new set of parameters in terms of the variables calculated by the Forward-Backward procedure (see for derivation Baum et al, 1970; Rabiner et al, 1986). Each of the three terms in expression [2.3.3.2.3] is maximized separately, by setting its derivative with respect to , A and B, respectively, equal to zero. p(x|Y,’) is not a function of  and can be dropped from maximization. The constraints in equations [2.1.1.3], [2.1.1.4], [2.1.1.6] and [2.1.1.7] are imposed with Lagrange multipliers. The results are:  i   0i [2.3.3.2.4] 52 T aij   ij t  i t t 1 T t 1 [2.3.3.2.5] T i  y t 0 t  t [2.3.3.2.6] T  t 0 t T  i2     y t 0 t  i  2 t [2.3.3.2.7] T  t 0 t Baum-Welch estimates transition probabilities (the A matrix) rather than transition rates (the Q matrix). Recently, the EM algorithm was modified to give maximum likelihood estimates directly of the Q matrix (Michalek and Timmer 1999). Despite its limitation, Baum-Welch remains the best choice for standard HMM parameter estimation and for state inference. 2.3.3.3 Segmental k-means algorithm for HMM The segmental k-means algorithm (Rabiner et al, 1986) for HMMs finds the complete data maximum likelihood, or maximum a posteriori parameter estimates: θ MAP  arg max max  pY | x , θ   px | θ   pθ  [2.3.3.3.1] θ ML  arg max max  pY | x , θ   px | θ  [2.3.3.3.2]   x x 53 Not unlike EM, SKM has an expectation step which uses Viterbi to find the most likely sequence of states given the data and a parameter estimate. A maximization step re-estimates the parameters from the data and the sequence found by Viterbi. The re-estimation formulae are similar to the ones used in Baum-Welch:  i  I  x0  i  [2.3.3.3.3] T aij   I x t 1  i   I  xt  j  t 1 T  I x t 1 T i   y  I x t t 0 T  I x t 0 T  i2   I x t 0 t t t 1 [2.3.3.3.4]  i  i [2.3.3.3.5]  i  i    yt   i  2 t T  I x t 0 t  i [2.3.3.3.6] I(.) is the indicator function, which returns one if the argument is true, and zero if the argument is false. Of course, the re-estimation formulae above can be implemented much more efficiently in practice. Instead of the indicator function, one can simply sum over the appropriate terms only. As Baum-Welch, SKM estimates the transition probabilities but not the transition rates. Somewhat simpler than Baum-Welch, SKM should be used mainly in situations when the complete data likelihood is strongly peaked for one sequence of states, or aggregation classes, relative to other sequences. Such a situation may arise, for example, when the signal to noise ratio is good, resulting in a predominant role for the emission probabilities. When accurate estimates are desired or when the signal-to-noise ratio is low, Baum-Welch is the best choice. 54 2.3.3.4 Expectation-maximization algorithm for linear Gaussian models An Expectation-Maximization algorithm for linear Gaussian models exists. Since I do not use it in this work, I refer the interested reader to (Roweis and Ghahramani, 1999, and references therein). It suffices to say that the maximum likelihood parameters of an LGM can be estimated very conveniently in a fashion similar to the way Baum-Welch algorithm does it for the hidden Markov model. 55 3. A maximum likelihood method for kinetic parameter estimation from single channel data Kinetic parameters can be estimated from single channel data with Markov model-based algorithms, using continuous time or discrete time hidden Markov models. While the physical process, i.e. the ion channel molecule, evolves continuously in time, only a discrete temporal representation of it is available. The experimental signal is sampled at regular intervals, digitized by an analog-to-digital converter and eventually processed with a computer. Hence, in order to use a continuous time Markov model, the continuous signal must be somehow reconstructed from discrete and noisy data. In comparison, discrete time Markov models work directly with the raw data. Continuous time methods have two significant advantages over discrete time methods. First of all, transition rates (and not transition probabilities) are estimated directly from a sequence of kinetic state or conductance class visits (dwells). Second of all, continuous time methods are very fast, the reason being that their computational complexity scales linearly with the number of dwells, which is generally much smaller than the number of data points, for a given data set. Despite these advantages, usage of continuous time methods is complicated by several problems. First of all, algorithms presented in Chapter 2 reconstruct the Markov signal from noise, but only at the sampling points. Due to the exponential distribution of the state lifetime, multiple transitions may occur, probabilistically, between two sampling points but remain unobserved (Colquhoun and Hawkes, 1995). The reconstitution of a continuous dwell sequence from a discrete time Markov signal is thus affected by the missed events problem. Clearly, whatever happens between two sampling points is unknown but nevertheless can be inferred statistically. Thus, several methods of correction for missed events exist (Roux and Sauve, 1985; 56 Blatz and Magleby, 1986; Ball and Sansom, 1988; Crouzy and Sigworth, 1990; Ball et al, 1993; Colquhoun et al, 1996; Qin et al, 1996). The exact correction (Colquhoun et al, 1996) is computationally expensive, whereas the first-order correction (Qin et al, 1996) is efficient and appears adequate for analysis of even fast kinetics. Second of all, continuous time methods require the discrete time Markov signal to be extracted from noisy data, which is not always possible. In addition to the simple amplitude threshold method (Sachs et al, 1982), more sophisticated procedures for signal extraction are available, such as the Hinkley detector (Draber and Schultze, 1994) and the hidden Markov model algorithms described in Chapter 2. Essentially, each data point must be classified into one kinetic state, or, if the model has aggregated states (with identical measurement characteristics), into one conductance class. Obviously, in the total absence of noise, any data point can be assigned with probability one to the correct conductance class. More realistically, in the presence of moderate noise, most data points can still be classified correctly, with high probability. This implies that the incomplete data likelihood function [2.2.1.7] has a (much) higher value for the correct sequence of conductance classes, C*, mostly due to the measurement term p(Y|X,): pY | X, θ  PX | θ |XC*  pY | X , θ  PX | θ |XC* [3.1] pY | X, θ |XC*  pY | X, θ |XC*  0 [3.2] Therefore, maximizing the incomplete data likelihood [2.2.1.5] can be limited to those state sequences in the class sequence C*, since all others have negligible contribution, regardless of the value of the parameter . Moreover, only the P(X|) term needs to be maximized. In contrast, when the signal-to-noise ratio is low, relations [3.1] and [3.2] no longer hold, and the incomplete data likelihood is flat with respect to the class sequence. Hence, in order to get unbiased kinetic estimates, all possible sequences must be considered. Therefore, continuous time 57 methods cannot be used, since, by definition, they work with a single conductance class sequence. Hence, discrete time methods are the only possible choice. One benefit of a discrete time method is that missed event corrections are necessary, since no assumption is made about what happens between recorded samples. The Baum-Welch algorithm (Baum et al, 1970), which is a particular instance of the more general ExpectationMaximization (Dempster et al, 1977), obtains maximum likelihood parameters from raw data, using hidden Markov models. While Baum-Welch estimates transition probabilities, it can be modified to obtain directly transition rates (Michalek and Timmer, 1999). EM-based methods are accurate and robust but have a slow convergence rate (Dempster et al, 1977; Wu, 1983). The most effective means of acceleration are gradient-based (Jamshidian and Jennrich, 1993; Aitkin and Aitkin, 1996; Lange, 1995), but they lose some of the valuable EM properties, such as simplicity and robustness. MPL (Maximum Point Likelihood) is a fast maximum likelihood algorithm (Qin et al, 2000a), which estimates directly transition rates from raw data, using discrete time hidden Markov models. The incomplete data likelihood is calculated with the Forward-Backward procedure, and it is maximized numerically with a quasi-Newton method, which provides faster, quadratic convergence. As in the MIL (Maximum Idealized Likelihood) algorithm (Qin et al, 1996), the likelihood function and its derivatives are calculated analytically, thus considerably increasing the computation speed. Of critical importance are the analytical derivatives of the matrix exponential with respect to transition rates (Ball and Sansom, 1989; Najfeld and Havel, 1994). Obviously, a point by point algorithm would be much slower than a dwell-based method. Hence, in spite of its mathematical competence, MPL is still computationally slower than MIL and therefore is limited in practice to short, noisy data sets with fast kinetics. For any other data, MIL is preferred, since it is equally accurate yet faster. Using model expansion (meta-models), 58 MPL can be applied to data with more than one channel, or with correlated noise and/or limited bandwidth (Qin et al, 2000b). The method I propose and describe in this chapter is broadly based on MPL, in the sense that it estimates directly the rate constants of a discrete time Markov model. Unlike MPL, however, it derives the estimates from a single conductance class sequence (idealized data). I therefore refer to this method as MIP (Maximum Idealized Point likelihood). MIP does not need missed event correction, since it operates with discrete Markov models. Hence, it should perform better than MIL for fast kinetics, when the first order approximation may not be adequate. However, it cannot compensate for signal extraction errors caused by correlated noise or overfiltering, although the problem can be alleviated by using state inference algorithms adapted to non-ideal data (Qin et al 2000b; Venkataramaran et al 1998a, b, 2000). In comparison, the missed event correction in MIL does accommodate, to some extent, these idealization artifacts. MIP gains much in speed over MPL, achieving a rate comparable to MIL’s. As MIL, it must be preceded by signal extraction (MPL requires conductance parameter estimation, which is computationally equivalent). Moreover, it is in principle possible to design an adaptive algorithm, which uses raw data for sections with high noise or fast kinetics (such as bursts of activity) and idealized data for the rest. As with MIL and MPL, linear constraints can be imposed on rate constant estimation. Although not conceptually new, MIP is supported by several arguments: it is an independent way of checking the results obtained with other methods; it is relatively fast; it is simple to implement; it is easy to apply to non-stationary data, such as the response of the ion channel to an experimental protocol which modifies the kinetics. Furthermore, it is easy to modify for other single-molecule applications, as indicated in Chapter 5. In the rest of this chapter I present the theory behind incomplete data likelihood maximization and its implementation in MIP. I extend the use of MIP to non-stationary data, 59 3.1 Algorithm MIP estimates rate constants directly from idealized data, by maximizing the incomplete data likelihood, computed by the dynamic programming Forward-Backward procedure, described in Chapter 2. The numerical optimization is performed by the dfpmin algorithm (Fletcher and Powell, 1963; Fletcher, 1981), as implemented in Press et al, 1992. Naturally, other numerical optimization methods can be used, as long as they work with analytical derivatives of the cost function, which is critical for a fast and stable optimization. The necessary mathematical derivations and the computation procedure are described in the next sections. 3.1.1 Parameterization and constraints Transitions between states are governed by macroscopic rate constants. Let qij be the macroscopic rate connecting states i and j. In the general case, qij is a function of a preexponential factor kij0, a ligand concentration, Cij, an exponential factor kij1 and an exponential quantity, Vij (for example voltage, pressure etc) (Hille, 1992):  qij  kij  Cij  exp kij  Vij 0 1  [3.1.1.1] For convenience, when there is no ligand binding, Cij is made equal to one. Likewise, when the voltage dependency is of no interest, or when the voltage across the data set is constant, kij1 is made equal to zero, in which case the exponential term is lumped into the pre-exponential. The parameters of interest for estimation are the pre- and the exponential constant factors kij0 and kij1. While there is no mathematical constraint on the exponential factor kij1, the pre60 exponential factor kij0 must be positive. To constrain estimation of a positive kij1, the two constants can be re-parameterized as follows:   vij  ln k ij wij  kij 0 [3.1.1.2] 1 [3.1.1.3] ln qij  vij  ln C ij  wij  Vij [3.1.1.4] Often, the physical nature of the ion channel demands that some constraints are imposed on the rate constants. For example, if the two binding sites of the acetylcholine receptor are presumed identical, then the rates of the two consecutive binding steps are constrained in the following way: k L  L  C 2 CL k CL2 0 0 [3.1.1.5] When both binding sites are free (state C), the probability that the ligand binds to either site is the double of the binding probability when only one site is free (state CL). Hence, the CCL transition rate is equal to twice the CL CL2 rate. Clearly, without a constraining mechanism, the estimates of the two rates would not be in the correct, 2:1 ratio. Constraints are extremely useful in simplifying the kinetics of complicated model and include restatements of a priori knowledge. We always seek the minimal number of free parameters to optimize. Whenever possible, constraints should be used, as each reduces by one the number of free parameters. Typical constraints are: i). fix the pre- or the exponential factor; ii). scale one pre- or exponential factor to another; iii). enforce loop microscopic reversibility of loops. These constraints can be formulated as follows: 61 k ij0  ct Fix k0:   ln k ij0  ct [3.1.1.6] vij  ct Fix k1: k ij1  ct [3.1.1.7] wij  ct k ij0  ct  k kl0 Scale k0’s:     ln k ij0  ln k ij0  ct [3.1.1.8] vij  vkl  ct Scale k1’s: Loop: k ij1  ct  k kl1 [3.1.1.9] wij  ct  wkl  0 q  q ij  ln q v ij ij   ln q ji    ln Cij  wij  Vij   v ji  ln C ji  w ji  V ji v   w ij [3.1.1.10] ji ij   Vij   v ji   w ji  V ji   ln C ji   ln Cij ct denotes the constrained value. I denote by r the set (vector) of constrained variables, composed of both vij and wij: r  vij , wij  [3.1.1.11] The constraint expressions above can be written in matrix form as a set of linear equations: 62 cM  r  cV [3.1.1.12] cM is the coefficient matrix and cV is the constant vector. The singular value decomposition (SVD) technique can be used to express the constrained variables r as a function of a set of unconstrained, independent variables x (Qin et al, 2000a): r  cA  x  cB [3.1.1.13] r  cA x [3.1.1.14] The matrix cA and the vector cB are determined from cM and cV, using SVD. x is the set of free parameters to be optimized. The parameters of interest, i.e. the pre- and the exponential factors k0 and k1 are obtained from x. 3.1.2 Likelihood function The incomplete likelihood of a sequence of data points, as calculated by the ForwardBackward procedure, can be written in matrix form as:   L  P0  B0  A0  B1   A1  B2 At 1 .Bt AT 1  BT   1 T [3.1.2.1] P0 is the state initial probabilities (column) vector, with properties identical to those of π, in expressions [2.1.1.2] to [2.1.1.4]. Bt is a diagonal matrix, whose diagonal entries are the Gaussian conditional emission probability densities, evaluated at data point t: 63 B t i , i   p yt | xt  i  [3.1.2.2] B t i , j   0 [3.1.2.3]   p yt | xt  i   N  i ,  i2 | yt [3.1.2.4] At is the transition probability matrix at time t. The time index is necessary for non-stationary data, when the data is a function of a time-changing stimulus or when the data is gathered under different conditions. As it stands in expression [3.1.2.1], computing the likelihood function is equivalent to propagating the knowledge about state probabilities along the data sequence, in the forward direction, without normalization. The computation starts with the a priori knowledge of the initial state probabilities, given by the (transposed) vector P0. The prior is updated into a posterior, using the evidence, i.e. the first data point y0. The posterior of the current point is projected ahead, as the prior of the next point, which is again updated using the evidence. The value of the likelihood function is obtained by summing the state probabilities at the last point, which is achieved by multiplication with the column vector of ones. The equivalent prediction-correction formulae are: ~T T P0  P0 [3.1.2.5] ~T ˆ T P P 0 0  B0 [3.1.2.6] predict: ~T ˆ T Pt  P t 1  At 1 [3.1.2.7] correct: ~T ˆ T P P t t  Bt [3.1.2.8] terminate: ˆ T 1 LP T [3.1.2.9] initialize: 64 The predicted state probabilities are corrected by the evidence, i.e. by post-multiplication with the B matrix. Clearly, the emission probability in [3.1.2.4] gives higher values for those states that explain better the observation. In other words, data has evidence that supports certain states more than others. Since B is diagonal, each element of P, Pi, is multiplied only with the corresponding diagonal element of B, Bii. Thus, some state probabilities are increased relative to others. Obviously, the more the prediction matches the evidence, point by point, the higher the value of the likelihood. Generally, post-multiplication by the B matrix does not preserve the column stochastic character of the P vector. Therefore, in the computation of the likelihood function, the state probabilities may decrease or increase rapidly, leading to numerical under- or overflow. This is prevented by re-normalizing the state vector at each point and using the normalized vector for computation of the next point. It can be shown that the likelihood is equal to the product of the norms at each point. Naturally, this quantity can be extremely small or large too, thus the logarithm of the likelihood should be calculated instead. The log-likelihood is equal to the sum of the norms’ logarithms. The normalization of the corrected P is: normalize: Pt  ˆ P t [3.1.2.10] NS  Pˆ i  i 1 t The calculation of the normalized state vector can be regarded as an application of Bayesian theory. The posterior state probability distribution is calculated as: pxt 1  i | yt 1   p yt 1 | xt 1  i   Pxt 1  i  p yt 1  65 [3.1.2.11] The prior, the likelihood, the evidence and the posterior are: prior: Pxt 1  i   Pt  At i  [3.1.2.12] likelihood: p yt 1 | xt 1  i   Bt 1 i , i  [3.1.2.13] evidence: p yt 1    Pt  A t  Bt 1 i  NS [3.1.2.14] i 1 pxt 1  i | yt 1   posterior: Pt  At  Bt 1 i NS  P  A i 1 t t  Bt 1 i  [3.1.2.15] Notice that P(xt+1=i|yt+1) is a simple number, representing the probability distribution of state x given observation y, evaluated for state i. 3.1.3 Computation and derivatives of the likelihood function Estimating maximum likelihood parameters involves maximizing the likelihood function. To do so efficiently and to have standard errors of the estimates, I use the Davidon-FletcherPowell algorithm (Fletcher and Powell, 1963), which is a quasi-Newton method with approximate line search. The same method is used by MIL and MPL (Qin et al, 1996, 2000a). The algorithm is known in the literature as dfpmin. I used the implementation in Press et al, 1992. Two variables,  and , are defined, each a column vector of dimension NS. They are initialized and calculated recursively,  in the forward and  in the backward direction, as follows: α 0  P0  B0 T T [3.1.3.1] 66   αt  P0  B0  A0  B1   A1  B2   A2  B3 At 1  Bt  [3.1.3.2] T T α t 1  α t  At  Bt 1  [3.1.3.3] βT  1 [3.1.3.4] βt  At  Bt 1 AT 1  ΒT   1 [3.1.3.5] βt  At  Bt 1   βt 1 [3.1.3.6] T T The likelihood L can be computed from  and/or , in any of the following ways: L  αT  1 T  [3.1.3.7]  L  P0  B0  β0 [3.1.3.8] L  α t  βt [3.1.3.9] T As mentioned previously, the probabilities are normalized to prevent numerical under- or overflow: ˆ 0T  P0T  B0 α [3.1.3.10] s0   ̂ 0i [3.1.3.11] i 1  α 0  αˆ 0  s0 [3.1.3.12]  ˆ t 1  α t  At  Bt 1  α [3.1.3.13] st 1  ̂ ti1 [3.1.3.14] i 67 1  α t 1  αˆ t 1  st 1 [3.1.3.15] The same scaling factors, st, calculated for the forward variable, are used for the backward variable.  T is not scaled, and the rest are scaled as:  βT  βT [3.1.3.16]  ˆ  A  B   β β t t t 1 t 1 [3.1.3.17]  1 β t  βˆ t  st 1 [3.1.3.18] Using the same scaling factors for  and  has the following consequences: T  α t  βt  1 T s t [3.1.3.19] L [3.1.3.20] t 0 From expressions [3.1.3.9] and [3.1.3.19], the following relation can be inferred: α T t   T  βt  L  α t  βt [3.1.3.21]  [3.1.3.22] And thus: α T t  T  At  Bt 1   βt 1  L  αt  At  Bt 1   βt 1 68   T  A t  B t 1    T  A t  B t 1   β t 1   L  α t   β t 1  αt  x x   [3.1.3.23] The derivative of the log-likelihood function can be obtained from the derivative of the likelihood function, as follows:  log L 1 L   x L x [3.1.3.24] In the following derivation, I apply the chain rule of matrix differentiation:  A  B  A B  B  A x x x k  A1  A 2  A k    A1  A i1   A i   A i1  A k  x x  i 1  [3.1.3.25] [3.1.3.26] The derivatives of the likelihood function and of its logarithm, with respect to a variable x can be expressed in terms of  and  as:   T 1 L  P0  B 0  T A t  Bt 1     β 0   α t   β t 1  x x x  t 0  T    log L 1  P0  B 0 1 T 1  T A t  B t 1     β 0   α t   β t 1  x L x L t 0  x  T [3.1.3.27] [3.1.3.28] The un-normalized ’s and ’s can be replaced with their normalized counterparts, using formula [3.1.3.23]. After simplification with L, it results: 69   T  log L  P0  B 0  T 1   T A t  B t 1      β 0   α t   β t 1  x x x  t 0  [3.1.3.29] If the B matrix is not a function of any of the optimized parameters, it can be taken out of the differentiation sign: T  T 1   T A    log L P0   B 0  β 0   α t  t  B t 1  β t 1  x x x  t 0  [3.1.3.30] As indicated in Chapter 2, the A matrix can be calculated from the Q matrix using the spectral decomposition technique (Moler and Van Loan, 1978): A  expQ  t    A i  expi  t  [3.1.3.31] i Ai’s are the spectral matrices and i’s are the eigenvalues of Q. The derivative of the A matrix with respect to a parameter x has the form (Najfeld and Havel, 1994): A  expQ  t  Q    A k   A l  f kl t  x x x k l  t  e k t  f kl t    e k t  e l t    l  k if k  l    if k  l    [3.1.3.32] [3.1.3.33] In order to calculate the log-likelihood function and its derivatives, the following quantities are required: the normalized forward and backward variables ( and ), the 70 normalizing coefficients (s), the eigenvalues () and the spectral matrices of the Q matrix (Ak), the derivatives of the Q matrix with respect to the non-zero transition rates, and the vector of initial probabilities (P0) and its derivatives with respect to the non-zero rate constants. In the following, I present the required calculations. The initial probabilities vector P0 can be specified in several ways. One is to let the user assign values, in which case the derivative of P0 w.r.t. a parameter x is the zero vector: P0 0 x [3.1.3.34] Alternatively, P0 can be the equilibrium distribution, satisfying conditions: P0  P0 .Q eq  0 t [3.1.3.35] P0i  1 [3.1.3.36] P i 0 1 [3.1.3.37] i Qeq is the matrix for which the equilibrium conditions hold. It could be the same as Q, or different. For example, one could set a conditioning-test protocol, in which the starting probabilities into the test pulse represent the equilibrium determined by the conditioning pulse experimental conditions. It is then necessary to express P0 as a function of Qeq. Let R be a matrix of dimension NS x (NS+1), formed by adding a column of ones to the right hand side of Qeq. P0 is then (Colquhoun and Sigworth, 1995a):  P0  u  R  R T  1 71 [3.1.3.38] u is a column vector of ones, of dimension NS. The derivative of P0 w.r.t a variable x can be obtained in the following way:    R  RT  R  RT x      R  RT   R  RT   x  R  R   x R  R   R  R   R  R   x R  R  T 1 T 1 T    R  RT x   1   x I  0 1 [3.1.3.39] T 1 T        R  RT   R  RT      x   T    RR       R  RT  1   R  R   x R  R  T 1 T   1 1     P0   u R  RT x x   1  [3.1.3.41] 1     R  T  R  R  RT   R  R   x  x   x   P0  u   R  R T x  [3.1.3.40]    R  RT   R  RT  x     R  RT   R  RT  x    0   1  1 [3.1.3.42] [3.1.3.43] T [3.1.3.44]  1 [3.1.3.45] T   R  T  R   T  . R  R    RR   x   x       1    [3.1.3.46] The expression above may appear complex but its computational cost is small since the few matrix multiplications and additions are required only once per iteration. The log-likelihood is a function of the Q matrix, or of several Q matrices, one for each set of experimental conditions that affect the ion channel response. For example, one may desire to apply a sequence of voltage or concentration pulses, to make the channel visit more the otherwise short-lived kinetic states. While each different concentration or voltage gives a 72 different Q matrix, the set of free parameters remains the same, as it does not depend on the experimental conditions. Thus each Q matrix involved in the likelihood computation is a function of the same free parameters. In the general case, the derivative of the log-likelihood function LL, with respect to a free parameter xk, has the form: c c  LL   LL  qij vij qij wij        c     xk wij xk   q v x c  i,j    ij  ij k  [3.1.3.47] c is an index over all experimental conditions. i and j are indices over the entries in the Q matrix corresponding to non-zero rates in the kinetic model. In order to calculate expression [3.1.3.47], the derivative of Q w.r.t. qij is necessary. Let the matrix Qij’ be this derivative. Qij’ has the properties: qij'kl  0 qij'kl  1 if k  i  and l  j  if k  i  and l  j  [3.1.3.48] qij'kl  1 if k  i  and l  i  The derivative of the log-likelihood function with respect to a macroscopic rate qijc is: T  LL P0   B  β 0 0  qijc qijcc0 73 T 1  t 0 t 1c   T A t   α t  c  B t 1  β t 1  qij   [3.1.3.49] The first term in the right-hand side is non-zero only for the index c corresponding to the experimental conditions that give the starting probabilities. If P0 is fixed (user-input), the term is zero for any c. The term inside the summation is non-zero only if the experimental conditions at time t are the same as those indexed by c. The consequence is that the computational cost for expression [3.1.3.49] remains constant regardless of the number of different experimental conditions, since the same number of non-zero terms (T+1) are to be calculated. The derivatives of the macroscopic rate constant qij w.r.t. a constrained parameter are: q ij vij qij wij  q ij [3.1.3.50]  qij  Vij [3.1.3.51] The derivative of a constrained parameter w.r.t. to a free parameter xk is: vij xk wij xk  cA ijv ,k [3.1.3.52]  cA ijw,k [3.1.3.53] As mentioned, the dfpmin optimizer provides variance estimates for the optimized, free parameters x. Variance estimates of the pre- and exponential factors, k0 and k1, can be derived as follows (Qin et al, 2000a): 74 2   dk ij0 dvij     Var k    Var xk      dvij dx k   k      [3.1.3.54] 2   dkij1 dwij     Var k    Var xk      dw dx   k  k   ij   [3.1.3.55] 0 ij 1 ij       Var kij0   Var xk   kij0  cA ijv ,k  2 [3.1.3.56] k   Var kij1   Var xk   cA ijw,k  2 [3.1.3.57] k 3.1.4 Speeding up the computation The algorithm as presented could be applied to raw data, in which case it becomes quasi-identical to MPL (Qin et al, 2000a). Alternatively, it can be applied to idealized data, by replacing the emission probabilities in the Bt matrix with 0’s and 1’s: 1 when the class of the idealized data point at time t is the same as the class of the entry in the matrix, and 0 otherwise. The 0 and 1entries in the Bt matrix have the effect of selecting a single class sequence out of all possible sequences. For this situation, I propose a way of speeding up the calculation, by pre-computing the multiplications in the likelihood function. For as long as the class of the data points remains unchanged, the likelihood function and its derivative have a multiplication of identical terms. For example, K consecutive data points of the same class have identical B and A matrices: L  At 1  Bt 1  At 1  Bt 2  At 1  Bt 3 At 1  Bt K  This is the same as multiplying with one term raised to the Kth power: 75 [3.1.4.1] L  At 1  Bt   K [3.1.4.2] The Kth power can be written as a product of terms raised to smaller powers, with the condition that these powers sum up to K: At1  Bt K  At1  Bt k  At 1  Bt k 1 M k l 1 l 2 At 1  Bt  M K k [3.1.4.3] [3.1.4.4] The powers of the individual terms can be chosen to be themselves powers of two, e.g. 1, 2, 4, 8… etc. This choice has the consequence that a term raised to any even power is equal to the product of two terms raised to half that power: At1  Bt 2k  At1  Bt k  At 1  Bt k [3.1.4.5] It is therefore enough to calculate the term (At-1Bt) and, in the way of expression [3.1.4.5], its 2nd, 4th, 8th… powers, up to a given maximum. Obviously, the calculation should be done for each conductance class. Suppose, for example, that there is a dwell 73 points long (strictly speaking, not a dwell, but 73 consecutive points of the same class). Instead of 73 multiplications, 3 are enough: At1  Bt 1  At1  Bt 2 At1  Bt 73  At 1  Bt 64  At1  Bt 8  At 1  Bt  [3.1.4.6] 76 The three powers in the right hand side of the above equation are pre-calculated. This results in a considerable speed improvement. Care must be taken though to avoid numerical under- or overflow, which may occur if too many terms are multiplied before normalization. The speed-up treatment applies to the log-likelihood function as well as to its derivatives. Obviously, for the pre-multiplications to be valid, the A matrix must be constant in all terms involved. I mentioned in the introduction to this chapter that with very noisy data and/or very fast kinetics the likelihood function does not peak strongly for a particular class sequence. Hence, choosing a single sequence over others results in significant errors. A threshold on state assignment probabilities could be defined, say 0.9. If any one state i has a probability above the threshold, then the algorithm could use the idealization, otherwise it uses the raw data. This mix of algorithms provides a useful compromise between computation speed and accuracy. No results of this kind of optimization are yet available. 3.2 Results Since MIP is a close kin of the MPL algorithm, an exhaustive evaluation of its performance and characteristics is not necessary. Instead, in Chapter 4 I use MIP with baseline correction, side by side with MIL and MPL. In Chapter 5, on applications of infinite state-space Markov models to single molecule data, I modify MIP in order to apply it to single molecule fluorescence data from kinesin. In Chapter 6, the same parameterization of the transition rates, linear constraint formalism, form of the log-likelihood function derivatives for stationary and non-stationary data and other computational aspects are used for analysis of ensemble Markov models applied to macroscopic patch-clamp recordings. 77 3.3 Discussion The results in Chapter 4, and further results that are not shown here indicate that the method I proposed is valid. Not surprisingly, its estimates are nearly identical to those obtained with MPL, when the data has a good signal-to-noise ratio and the likelihood function is peaked for one class sequence. For higher noise, the performance of the idealization procedure itself degrades. At SNR ≤ 1 the degradation is so extreme that only MPL, i.e. raw data, can be used. The missed event correction in MIL can be pushed further than the minimum of one sampling interval, in which case the missed events caused by over-filtering are, to some extent, recovered. In practice though, the choice of the dead time parameter for missed events caused by over-filtering is subject more to the user’s experience than to a rigorous formulation, as it is nonlinear and depends on unknown factors, such as the frequency response of the patch clamp amplifier. Limited as it is, the missed event correction is remarkably good for most practical situations. When the data has a very low signal to noise ratio, the re-estimation/idealization algorithm (whether Segmental k-means or Baum-Welch) makes errors of a different kind. When the Gaussian conditional emission probabilities start to overlap, inevitably both false positive and false negative classification errors occur. This is different from the over-filtering case, which results only in false negative errors. MIP does not have and does not require a missed event correction. When over-filtering is suspected to result in idealization errors with standard algorithms, re-estimation/idealization methods suitable for correlated noise can be used (Qin et al, 2000b), or more simply decimation to the Nyquist limit. When high noise is the source of idealization errors, MIP performs similar to MIL. Very fast kinetics push MIL to the limits of it first-order missed events correction, but it does not seem to affect either MIP or idealization with Baum-Welch (results not shown). 78 4. Signal detection and baseline correction of single channel records Both discrete and continuous time Markov models can be used for kinetic analysis of single channel records, as explained in Chapter 3. In addition to kinetics, the current amplitude and the state noise characteristics of each conductance class are of interest. Conductance parameters can be estimated with discrete time hidden Markov models. Baum-Welch (Baum et al, 1970) and Segmental k-means (Rabiner et al, 1986) algorithms estimate simultaneously the maximum likelihood kinetics and conductance parameters. Both algorithms include a state inference step, performed respectively, by Forward-Backward (Baum et al, 1970) and Viterbi (Viterbi, 1967; Forney, 1973). An optimal sequence of conductance classes is obtained as a byproduct of the state inference step. The reconstruction of a continuous sequence of dwells from digitized, noisy data is a procedure known as data idealization. From the idealized data, algorithms based on continuous time Markov models estimate maximum likelihood kinetic parameters (Colquhoun et al, 1996; Qin et al, 1996). Data idealization and conductance parameter estimation generally rely on ideal data properties: Gaussian and non-correlated noise, stationary conductance parameters and a steady baseline. Existing methods can deal with correlated noise and filtering effects (De Gunst et al, 2001; Qin et al, 2000b; Venkataramaran et al, 1998a, b, 2000) but the presence of baseline artifacts is a more problematic issue. Examples of baseline artifacts are: slow drift caused by changes in electrode or membrane potential, changes in patch or seal properties, sinewave interference from power lines, or other stochastic fluctuations of unknown nature. Non-stationary baseline affects the accuracy of estimates and, in extreme cases, may prevent any kind of analysis. The problem of deterministic baseline artifacts in single channel records has been considered before, and correction methods proposed (Sachs et al, 1982; Chung et al, 1991; 79 Baumgartner et al, 1998). In contrast, to my knowledge stochastic baseline fluctuations have never been addressed. In fact, deterministic artifacts can be regarded as a limit case of stochastic fluctuations. It is easy to see how all baseline artifacts may be interpreted as non-stationary conductance parameters, i.e. changes in the mean current. I model baseline fluctuations, of any nature, with a discrete time, continuous state, dynamic stochastic process, namely the linear Gaussian model (LGM; see Chapter 2). The stochastic baseline process affects the emission probabilities of the hidden Markov model which describes the ion channel signal. In the following section, I discuss the relevance of hidden Markov models with stochastic parameters, or with mixed states, to the baseline correction problem. Next, I describe the linear Gaussian model of the stochastic baseline process. I introduce two novel methods for baseline correction, each based on coupling algorithms for hidden Markov and for linear Gaussian models. I test the resulting baseline correction methods with simulated and experimental single channel data, contaminated with different types of artifacts. I determine the performance of the algorithms in removing stochastic or deterministic (sinusoidal interference) artifacts and establish their operational range, with regard to state noise and the magnitude of the ion channel signal relative to baseline fluctuations. The performance is evaluated by comparing the kinetics and conductance parameters estimated from non-contaminated data with those estimated from baseline-corrected data. 4.1 Hidden Markov models with stochastic parameters or with mixed states Hidden Markov models with stochastic parameters are interesting in two situations: i). the physical process modeled has genuinely stochastic, time-variable parameters; ii). there is another stochastic process which contaminates the experimental measurement. Ion channels with modal kinetics are an example of the first class; these channels switch between different kinetic 80 schemes. The kinetic parameters change between different discrete values, for example as a result of molecular interactions, such as phosphorylation (Swope et al, 1999). The changes in parameters may be continuous as well, as indicated by ion channels with the so-called “wanderlust” kinetics (Silberberg et al, 1996). Baseline fluctuations are an example of the second class, in which the measurement of a hidden Markov model is contaminated by another stochastic process. Although the measurement characteristics do not change in time as a stochastic process, they can be interpreted as such. The result is a Markov model with stochastic emission parameters. When the stochastic parameters take discrete values, the overall process can be modeled with another, larger Markov meta-model. The meta-model has discrete meta-states, which are combinations of process and parameter states. In comparison, when the stochastic parameters take continuous values, the overall process can be modeled either as a Markov model with stochastic parameters or as a mixed-state (continuous and discrete) dynamic model. The choice of modeling with stochastic parameters or with mixed states is determined, in practice, by which interpretation has the most efficient problem-solving algorithm. The economics and engineering literature abound in examples of mixed states models (Shumway and Stoffer, 1991; Kim, 1994; Ghahramani and Hinton, 1996). While meta-models are approached with the same Markov-based algorithms, models with mixed states are much less amenable to analysis. Existing algorithms are Monte Carlo (Carter and Kohn, 1996), Bayesian (Billio et al, 1998) or EM based (Logothetis and Krishnamurthy, 1999; Doucet and Andrieu, 1999), and are all very recent developments. 4.2 Baseline linear Gaussian model 81 For ion channels, the emission probabilities of the hidden Markov model are, ideally, Gaussian distributions, with constant mean i and variance 2i, conditional on the state i:   p yt | xt  i   N i , i2 | yt [4.2.1] I modify this representation by replacing the constant conditional mean i with a time variable conditional mean i,t:  i ,t   i  bt [4.2.2] bt is the value of the baseline offset at time t. The new conditional emission probabilities are:   p yt | xt  i   N bt   i ,  i2 | yt [4.2.3] By definition, all states in the closed conductance class have mean equal to zero (i.e. they do not conduct current): i  0 i | i  closed class [4.2.4] The baseline stochastic process b is described by a linear Gaussian model, with process equation: bt  A B  bt 1   B t 1 82 [4.2.5] Since the state vector b has a dimension of one (i.e. a scalar) then b, the state transition matrix AB, and the process noise covariance QB are all scalars. Thus, AB is denoted by aB and QB by qB. By definition of a baseline, a slow deterministic drift present in single channel records is negligible on a short time scale. Thus it can be safely assumed that the state transition parameter aB is equal to one, which corresponds to a slope of zero. I also assume that the process variance qB is constant in time. Equation [4.2.5] simplifies to: bt  bt 1  B [4.2.6] Equation [4.2.6] represents an unbiased, unidimensional random walk, with expected mean b0 and variance qB. The connection with the experimental physical phenomenon is evident, although not perfect. In [4.2.6], bt is free to take any value and to drift probabilistically up or down (although its expectation remains always equal to b0). In reality, if slow drift is discounted, the baseline is constrained to stay around the same approximate value in spite of local fluctuations, stochastic or deterministic such as sinusoidal. This behavior is quite different from that of an unbiased random walk, which is under no such constraint. The model baseline is memory-less: the next baseline offset is a function of the current offset only. In comparison, the real baseline appears to be memory-less too, but perhaps not to the same degree. An auto-regressive model of some (small) order k would be a more accurate representation, since the next state of an auto-regressive model is a function of the previous (k) states. Certainly, a higher-order more complex model would better capture the characteristics of the real phenomenon but the mathematical complexity and the computational costs may not be justified by the results. As will be demonstrated in the results section, the simple model in equation [4.2.6] is capable of tracking large fluctuations, stochastic or sinusoidal (and implicitly slow drift). 83 Since the emission probabilities are modified by the stochastic baseline process, the Markov model may be regarded as having stochastic parameters, with continuous states. Alternatively, the ion channel signal, contaminated with baseline fluctuations, may be described by a stochastic model with mixed states: discrete and continuous. A graphical depiction of this construct is shown in Figure [4.2.1]. The ion channel’s Markov discrete state xt depends probabilistically only on the previous state xt-1. The continuous baseline state bt depends probabilistically only on the previous state bt-1. The measurement yt is a probabilistic function of the state xt and a deterministic function of the baseline bt, as in equation [4.2.4]. The measurement equation of the mixed state process can be written as: yt  bt   xt [4.2.7]  x is a random Gaussian variable, with mean and variance conditional on the Markov state xt. t  x is the source of measurement errors. Clearly, as indicated by expression [4.2.7] above, yt t depends deterministically on bt and probabilistically on xt. xt-1 xt xt+1 yt-1 yt yt+1 bt-1 bt bt+1 Figure [4.2.1]. Mixed-states stochastic model of the ion channel signal contaminated with stochastic baseline fluctuations. Squares represent discrete states and circles represent continuous states. Dotted arrows denote stochastic dependencies and continuous arrows denote deterministic dependencies. Shaded symbols represent observed variables and clear symbols represent unobserved (hidden) variables. xt is the hidden Markov model state (ion channel conformation) and bt is the linear Gaussian model state (baseline offset). 84 The ion channel process and the baseline process are uncoupled, evolving independently of each other as represented in Figure [4.2.1]. The state of one does not interfere in any way with the state of the other. The measurement, however, is a function of both states, discrete and continuous. The measurement couples the discrete (HMM) and the continuous (LGM) models for the purpose of state inference, and thus for parameter estimation. If the Markov state sequence X were known, the HMM signal could be subtracted from the observation sequence Y, and the remainder signal (the baseline) could be processed with linear Gaussian algorithms. The converse is also true: if the baseline were known, it could be subtracted from the observation sequence and the remainder signal (the ion channel) could be analyzed with hidden Markov methods. Figure [4.2.2] shows an example of baseline fluctuations generated with equation [4.2.6], with process variance qB = 0.0001. bt is drawn as a function of time. It is interesting to notice the quasi-identical appearance of data at different time scales, a characteristic of diffusional processes. Figure [4.2.3] presents power spectra of data generated with a Markov model, a linear Gaussian model and non-correlated (white) Gaussian noise. The Markov spectrum is a sum of Lorentzians, parameterized by the eigenvalues of the Q matrix. The LGM spectrum is of 1/f type, which explains the time domain invariance with respect to time scale. The white Gaussian spectrum is uniform at all frequencies. 85 A 1 pA 50 ms B 0.1 pA B 2 ms Figure [4.2.2]. Data simulated with a linear Gaussian model, with process variance qB 0.0001, and no measurement error. The inset in A is expanded in B. 86 MM 1 0 -1 -2 0 1 2 3 log10 (Frequency [Hz]) 4 0 1 2 3 log10 (Frequency [Hz]) 4 0 1 2 3 log10 (Frequency [Hz]) 4 Gaussian state noise B C -2 -3 -4 -5 4 LGM Log10 (pA^2/Hz) Log10 (pA^2/Hz) Log10 (pA^2/Hz) A 2 0 -2 -4 D 1 100.0 50.0 2 5000.0 2000.0 3 Figure [4.2.3]. Power spectra of dynamic models and of Gaussian white noise. A Spectrum of Markov model (no measurement noise), with open class amplitude 1.0. B Spectrum of white Gaussian noise, with standard deviation 0.1. C Spectrum of linear Gaussian model, with process standard deviation 0.01 (qB 0.0001). D Markov model used for simulation of data analyzed in A. All data have one million points and are sampled at 100 kHz. 87 4.3 Algorithms for baseline correction The signal and the noise components commonly present in a patch clamp record are summarized in Figure [4.3.1]. The three components, the ion channel signal, the background noise and the baseline fluctuations are considered independent and hence additive in an RMS sense. By definition, the states in the closed conductance class have no intrinsic noise, while the open states should have a certain amount of excess noise. Various noise sources in the recording chain, typically arising from the membrane patch and the amplifier, provide the wideband background noise. This noise is Gaussian, and approximately white, although having a high frequency boost followed by a bandlimiting filter roll-off (Venkataramanan et al, 1998a, b, 2000). The baseline fluctuations are either stochastic or deterministic (e.g. slow drift with sinusoidal interference). The goal of baseline correction is to eliminate the baseline fluctuations and preserve the other components. A perfectly deterministic corrupting signal can, in principle, be removed without any error. However, when separation of two stochastic signals is attempted, false positive and false negative classification errors occur. Therefore, removal of stochastic baseline artifacts from another stochastic signal (the noisy hidden Markov model of the ion channel) can not be perfect. The baseline removal procedure is subject to statistical errors and it will affect the estimated Markov signal, the state excess noise and the background noise. For example, an abrupt, upward change in baseline can be mistaken for a channel opening, thus modifying the apparent kinetics. The variance observed in the data (after the Markov signal is subtracted) must be partitioned between state excess noise and background noise, on one hand, and baseline fluctuations (stochastic or not) on the other hand. 88 Ion channel signal Background moise excess noise + additive white Gaussian noise + linear Gaussian model Patch clamp record Baseline fluctuations hidden Markov model Figure [4.3.1]. Signal and noise components in a patch clamp record. Even when there are no baseline fluctuations, a correction algorithm a fortiori underestimates the variance of the state noise and the background noise, and overestimates the variance of the baseline. The problem is complicated even more by the non-stationary character of the artifacts model, the parameters of which are, at best, only approximations. Thus, no baseline correction algorithm should be expected to perform flawlessly but, within a certain signal-to-noise range, it should be robust and not introduce large errors. If the amplitude of wideband background fluctuations remains well within the state amplitudes throughout the record, standard state inference procedures, like Forward-Backward or Viterbi work acceptably well. However, the variance estimation will be unreliable for the reasons discussed above. When the signal to noise is low, standard algorithms fail and the kinetics cannot be solved. 89 The ion channel signal contaminated with baseline fluctuations can be modeled with a dynamic model with mixed states (discrete and continuous). Although it is certainly possible to compute the mixed states using the Bayesian theory, it is an intractable, combinatorial optimization problem (Tugnait, 1982). Suppose the discrete state sequence (of the ion channel) is known. Then the continuous state sequence (of the baseline) could be estimated optimally with a Kalman filter-smoother. In contrast, if the discrete states are not known, the continuous sequence has to be estimated for each of the NST possible discrete sequences (where NS is the number of discrete states and T is the length of data). Likewise, if the continuous state sequence were known, the discrete sequence could be optimally estimated with the hidden Markov filtersmoother (Forward-Backward). In contrast, if the continuous states are unknown, the discrete sequence has to be estimated by integration over the continuous state space, with one integral sign for each data point. The intractability can be illustrated in a different way. Suppose that at time 0, the Markov and the baseline states are, respectively, X0 and b0, where X0 is a state probability vector and b0 is a Gaussian probability distribution with mean b0 and variance V0 (see Chapter 2, sections 2.1.1 and 2.1.2). The mixed state M0 is specified as the discrete-continuous pair: M 0  X0 , b 0  [4.3.1] Bayesian estimation of the mixed state (MAP state estimation) involves the propagation of a probability density. The discrete and continuous components of the mixed state can be predicted independently from the previous state, for any t, as follows: predicted discrete state: ~ T T Xt 1  Xt  A [4.3.2] predicted continuous state: ~ bt 1  N bt ,Vt  qB  [4.3.3] 90 To simplify notation, I write xt=i as xti, for any t. The marginal probability distribution of b0, conditional on the measurement y0, is obtained by marginalization with respect to each possible discrete state i: NS  pb0 | y0    p b0 , x0i | y0  [4.3.4] i 1   p b0 , x0i | y0      p y0 | b0 , x0i  pb0   P x0i p y0  [4.3.5] All the terms in the right-hand side of expression [4.3.5] are Gaussian, except P(x0i) which is a simple number. Thus, the joint probability p(b0,x0i|y0) is also a Gaussian. Then p(b0|y0) is a sum of NS Gaussians, each multiplied with a weighting factor P(x0i). The marginal probability distribution of b1, conditional on the measurement y1, is obtained like b0, as follows:  NS pb1 | y1    p b1 , x1i | y1  [4.3.6] i 1     p y1 | b1 , x1i  pb1   P x1i p b1 , x | y1  p y1   i 1  [4.3.7] But the prior p(b1) is a projection of p(b0|y0), which is a sum of NS Gaussians. Hence, p(b1) is also a sum of Gaussians, one for each possible state of X1. Thus p(b1,x1i|y1) is also a sum of NS Gaussians and, consequently, p(b1|y1) is a sum of NS*NS Gaussians. By induction, the marginal probability distribution of bt, conditional on the measurement sequence {y0,…,yt} is a sum of NSt Gaussians. Clearly, the conditional probability space of b grows exponentially with 91 the sequence length. In contrast, for models with either discrete or continuous states only, the conditional space growth does not occur. Although a tractable solution to the state inference problem for mixed states models is not possible, approximate methods exist (Boyen and Koller, 1998; Frey and MacKay, 1997; Koller et al, 1999; Lauritzen, 1992; Minka, 2001; Murphy et al, 1999; Opper and Winther, 1999; Shachter, 1990; Yedidia et al, 2000). I present here two new algorithms which I refer to, respectively, as Viterbi-Kalman and Forward-Backward-Kalman. The first finds the mixed state sequence which maximizes the complete data likelihood of the discrete state sequence X, the continuous state sequence b and the data sequence Y, using the Viterbi algorithm and the Kalman filter. The second is an ExpectationMaximization algorithm which finds a maximum likelihood baseline sequence, interpreted as the continuous, stochastic parameters of a hidden Markov model. The expectation step is performed by a modified Forward-Backward procedure, and the maximization step is performed by a modified Kalman filter-smoother. The algorithm has an additional step for parameter initialization, which is an issue not addressed in a similar algorithm (Logothetis and Krishnamurthy, 1999). The two methods work well even for large amplitude fluctuations, and can correct stochastic, as well as deterministic artifacts (slow drift and sinusoidal interference). The first method computes a filtered state estimate while the second computes a smoothed estimate. Thus, not quite surprisingly, the second method works better under high noise or heavy baseline fluctuations but it is slower than the first. Both methods give reliable state variance estimates. 4.3.1 Viterbi-Kalman algorithm For a hidden Markov model, the Viterbi algorithm finds the most likely sequence of states, by maximizing the complete data likelihood: 92 X ML  arg max  pY, X  [4.3.1.1] X For a mixed state process, one would like to find the most likely sequence of mixed states, by maximizing the complete data likelihood of the discrete state sequence X (ion channel), the continuous state sequence b (baseline) and the data sequence Y: X, bML  arg max  pY, X, b [4.3.1.2] X ,b Since X and b evolve independently, the complete data likelihood expression can be written as: pY, X, b  pY | X, b  PX  pb [4.3.1.3] T pY, X, b   p y0 | x0 , b0   Px0   pb0    p yt | xt , bt   Pxt | xt 1   pbt | bt 1  [4.3.1.4] t 1 The quantities in the right-hand side of expression [4.3.1.4] are as described previously, except p(y|x,b), which is the conditional Gaussian probability density of observing data point y, given that the channel is in state x and the baseline offset is b:   p y | x, b  N  x  b, x | y 2 [4.3.1.5] The previous section has demonstrated that Bayesian mixed state inference is intractable, because forward propagation of the continuous state probability density needs NSt Gaussians at time t, for NS discrete states. However, I show in the following that obtaining the most likely 93 sequence of discrete and continuous states, by maximizing the complete data likelihood, is a tractable problem, by way of a certain approximation. Suppose that at time 0, the mixed state, M0, is specified as discussed, with a discrete state probability vector X0, and a continuous state Gaussian probability density b0, with mean b0 and variance V0: M 0  X0 , b 0  [4.3.1.6] Expression [4.3.1.6] gives the prior distribution of X0 and b0, i.e. in the absence of the measurement y0. To simplify notation, I write xt=i as xti, for any t. Given y0 and x0=i, the conditional posterior of b0 can be calculated as follows:   p b0 | y0 , x0i   p b0 , y0 , x0i p y0 , x0i           [4.3.1.7]    p y0 | b0 , x0i  p b0 , x0i p y0 , x0i   p y0 | b0 , x0i  pb0   p x0i p y0 | x0i  p x0i p b0 | y0 , x0i  p b0 | y0 , x0i     p b0 | y0 , x0i         p y0 | b0 , x0i  pb0  p y0 | x0i   [4.3.1.8] [4.3.1.9] [4.3.1.10] The measurement Y is a function of both X and b, conform to [4.3.1.5]. When X is known, its contribution to Y can be eliminated, i.e. subtracted from Y. Hence, expression [4.3.1.10] can be written more simply as: 94   p y 0i | b0  pb0  p b0 | y  p y 0i  i 0     [4.3.1.11]  y0i  y0 | x0i  y0   i [4.3.1.12]   The terms in the right hand side of expression [4.3.1.11] are Gaussian, thus p b0 | y 0i is also a Gaussian, optimally calculated by the Kalman filter:    p b0 | y 0i  N b0i ,V0i  [4.3.1.13] I denote by 0i the complete data likelihood of the mixed state sequence and data sequence, up to time t=0. 0i is calculated as follows:    0i  max py0 , b0 , x0i  [4.3.1.14]  0i  max py0 | b0 , x0i  pb0   Px0i  [4.3.1.15]  0i  Px0i  max py0 | b0 , x0i  pb0  [4.3.1.16] b0 b0 b0 Clearly, [4.3.1.16] is maximized by the same b0 as obtained in [4.3.1.13], i.e. b0i:  0i  p y0 | b0i , x0i  pb0i  Px0i  [4.3.1.17]  0i  p y0i | b0i  pb0i  Px0i  [4.3.1.18] I denote by 1j the complete data likelihood of the mixed state sequence and data sequence, up to time t=1. 1j is calculated as follows: 95   1j  max py0 , b0 , x0i , y1 , b1 , x1j  b0 ,b1 ,i [4.3.1.19] 1j  max py1 | b1 , x1j  pb1 | b0   Px1j | x0i  py0 | b0 , x0i  pb0   Px0i  [4.3.1.20] b0 ,b1 ,i 1j  max py1j | b1  pb1 | b0   Px1j | x0i  py0i | b0  pb0   Px0i  [4.3.1.21] b0 ,b1 ,i Generally, calculating tj is not tractable, since tj is a function of 2t-1 parameters. However, an approximation can be used, that allows a recursive calculation of tj. Thus, tj can be approximated as follows:    t j  max  Pxtj | xti1  max pytj | bt  pbt | bti1    ti1   i bt [4.3.1.22]  As with Viterbi, the previous discrete state is stored in the tj variable:  t j  arg max  Pxtj | xti1  max pytj | bt  pbt | bti1   ti1  i   bt [4.3.1.23] The maximization with respect to bt is performed with a Kalman filter, which gives   p bt | ytj as a Gaussian:    p bt | ytj  N bt j ,Vt j 96  [4.3.1.24]   To calculate p bt | ytj , the standard Kalman forward expressions [2.3.2.3.4] to [2.3.2.3.8] are used, with two modifications: the measurement yt is replaced with ytj, and the measurement variance rt is replaced with j2. Thus, bt+1j and Vt+1j are computed as follows:   ytj1  yt 1 | xtj1  yt 1   j  [4.3.1.25]  rt j 1  rt j 1 | xtj1   2j [4.3.1.26] ~ bt j1  bˆti [4.3.1.27] ~ Vt j1  Vˆt i  q B [4.3.1.28]  ~ ~ K t j1  Vt j1  Vt j1  rt j 1  1 [4.3.1.29]  ~ ~ bt j1  bt j1  Kt j1  ytj1  bt j1  [4.3.1.30] ~ ~ Vt j1  Vt j1  K t j1 Vt j1 [4.3.1.31] Naturally, in order to prevent numerical under- or overflow, the logarithm of  is used instead:        ln tj  max ln P xtj | xti1  ln p ytj | bt j  ln p bt j | bti1  ln ti1 i   t j  arg max ln Pxtj | xti1   ln pytj | bt j   ln pbt j | bti1   ln ti1  [4.3.1.32] [4.3.1.33] i In addition to  and , Viterbi-Kalman stores at any time t, for each discrete state i, the conditional baseline bti. The optimal mixed state at time T (last data point) is that which terminates the mixed state sequence with the highest complete data likelihood. From time T, the 97 most likely sequence of mixed states is reconstructed backwards, down to the first data point, using the  variable:   xT  arg max  Tj [4.3.1.34] j xt   txt11 [4.3.1.35] bt  btxt [4.3.1.36] Vt  Vt xt [4.3.1.37] The Viterbi-Kalman algorithm can be summarized as follows: calculate 0i initialize: for time t = 0 to T-1 do for each xt+1=j do for each xt=i do calculate (bjt+1, Vjt+1) with a Kalman filter calculate jt+1 end i choose i that maximizes (jt+1 | it) assign jt+1 := jt+1 | it assign jt+1 := i assign (bjt+1, Vjt+1) := (bjt+1, Vjt+1) | (bit, Vit) end j end t j := arg maxj(jT) 98 xT := j, bT := bTj, VT := VTj for time t = T – 1 downto 0 do j := t+1x t+1 xt := j, bt := btj, Vt := Vtj end t It must be noted that the most likely mixed state sequence obtained by the ViterbiKalman algorithm is only approximate, in the sense that it is a filtered and not a smoothed state estimate. Unfortunately, an exact solution is not tractable, but, if one is willing to accept higher computational costs, Viterbi-Kalman can be improved, in the sense of a fixed-lag smoother. 4.3.2. Forward-Backward-Kalman algorithm The Forward-Backward-Kalman (FBK) algorithm operates with a hidden Markov model (the ion channel) with stochastic, continuously-valued emission parameters (the baseline offset). The emission parameters are modeled with a linear Gaussian model. FBK uses the ExpectationMaximization mechanism to perform Markov state inference, followed by maximum likelihood estimation of the stochastic parameters. It is very similar, in fact, to Baum-Welch: given the current parameter estimates (baseline offset), the state sequence is inferred, using a modified Forward-Backward procedure. From the inferred state sequence, the parameters are re-estimated (new baseline offset), using the Kalman filter smoother. The iteration is repeated until some convergence criterion is reached. I denote by  the enlarged parameter set of the hidden Markov model with stochastic emission parameters.  consists of the hidden Markov parameters , complemented by the baseline b, and by the baseline process variance qB: 99 ψ  θ,b, qB  [4.3.2.1] The EM Q function can be written in terms of the  parameters: Qψ , ψ'    ln px , Y | ψ   px | Y , ψ'  [4.3.2.2] X Ideally, all of the parameters that characterize  must be estimated, by maximizing the Q function. Unfortunately, this is not possible, for several reasons. The hidden Markov model is a rather accurate representation of the ion channel physical process. In contrast, the simple linear Gaussian model used is only an approximation of the baseline process. The fluctuations are not entirely stochastic and do not have stationary characteristics throughout the length of data, which prevents re-estimation of the baseline process variance qB. The Q function is likely to have a surface with many local maxima, since it depends on an enlarged parameter set , with stochastic components (i.e. b). Thus, if the initial baseline is not accurate enough, the algorithm is likely to get trapped in a local, erroneous maximum. This makes the initialization of the baseline sequence a critical issue. Preliminary tests have indicated that sequential parameter re-estimation gives better results. Thus, for a given , a maximum likelihood estimate of b is obtained. Then, for the b estimate, a maximum likelihood of  is obtained. The two steps are iterated until convergence is reached. qB is not re-estimated. Each step is an EM algorithm in its own right, which estimates only half of the parameter set , conditional on the other half. The procedure can be described in pseudo-code as follows: 100 initialize select b0 and 0 repeat b-step: estimate bi+1 given i -step: estimate i+1 given bi+1 until convergence test This sequential parameter re-estimation does not necessarily guarantee the same convergence properties of EM. Despite this departure from the theoretical EM framework, practical application of the algorithm indicates good behavior. The Forward-Backward-Kalman algorithm corresponds to the b-step in the above scheme. If the estimated baseline sequence b is subtracted from the data sequence Y, the -step can be performed with Baum-Welch, and it is not discussed here. FBK can be succinctly described as follows: initialize: select an initial baseline sequence b0 repeat expectation: calculate Q(bi+1,i) with modified Forward-Backward maximization: re-estimate bi+1 with modified Kalman filter-smoother until convergence test The expectation step is performed with a modified Forward-Backward procedure, conditional on the current baseline and  parameters. The sufficient statistics are calculated (’s, ’s and ’s). The maximization step, performed by a modified Kalman filter-smoother, uses the sufficient statistics to calculate a maximum likelihood parameter sequence, i.e. a new baseline. The  parameters of the hidden Markov model (initial state probabilities, transition and emission 101 probabilities) are not changed during the iteration. Since convergence of the algorithm is not certain, the convergence test could be, for example, the reaching of a maximum number of iterations. In the following sections, I explain each step in detail. 4.3.2.1 Initialization Initialization of the baseline is important for avoiding local maxima. I propose here an initialization method based on the Expectation-Maximization framework. For each time t, expectation and maximization steps are iterated until convergence. The expectation step infers the current Markov state from the previous Markov state (which plays the role of initial state probability vector), conditional on the current baseline estimate. The maximization step calculates a maximum a posteriori (MAP) estimate of the baseline offset, conditional on the previous baseline offset and on the current inferred Markov state. The MAP baseline estimate is obtained with a Kalman filter. The estimate is maximum a posteriori and not maximum likelihood because its computation involves the previous baseline offset, which is regarded as a parameter prior. This MAP EM procedure is applied sequentially to each data point, in the forward direction. Suppose that at time t, the Markov state probability vector is Xt, and the baseline offset is a Gaussian probability density, with mean bt and variance Vt. At time t+1, the Markov state and the baseline offset can be predicted independently, as follows: predicted Markov state: ~ T T Xt 1  Xt  A [4.3.2.1.1] predicted baseline offset: ~ bt 1  N bt ,Vt  qB  [4.3.2.1.2] To simplify notation, I write xt=i as xti, for any t. The Forward-Backward procedure, as applied to one data point only, calculates the values of  and  from the initial state probability 102 vector, the current baseline offset and the measurement. For any iteration, the initial state probabilities are equal to the predicted state probabilities in equation [4.3.2.1.1]. The current baseline offset is initialized for the first iteration as the predicted baseline in equation [4.3.2.1.1], and for any other iteration is equal to the MAP estimate from the previous iteration. At time t+1, these quantities are calculated as follows:  ti1  ~ xti1  p yt 1 | xti1 , bt 1   i t 1   ti1 NS  i 1  [4.3.2.1.3]  [4.3.2.1.4] i t 1   p yt 1 | xti1 , bt 1  N i  bt 1 , i2 | yt 1 [4.3.2.1.5] The MAP baseline at t+1 can be obtained from the Q function and the baseline prior, as follows: btMAP  arg max Qbt 1 , bt 1'   ln pbt 1 | bt  1 [4.3.2.1.6] bt 1 Since only the measurement part of the Q function depends on bt+1, the MAP baseline is:        btMAP  arg max  ln p yt 1 | xti1 , bt 1  p xti1 | yt 1 , bt 1'  ln pbt 1 | bt  1 bt 1  i  [4.3.2.1.7] The solution to the MAP estimation becomes more obvious, if the problem is reformulated. Thus, bt+1 can be interpreted as the state of a linear Gaussian model, with process and measurement equations: 103 bt 1  bt  t t  N 0, qB  yt 1  bt 1  t 1 t 1  N    ti1  i ,   ti1   i2  [4.3.2.1.8]  NS NS   i1 i 1  [4.3.2.1.9] In other words, the measurement Gaussian random variable is a weighted sum of NS random Gaussian variables, each with mean i, variance i2 and weighting factor it+1. Each of these random variables corresponds to one of the Markov states. The MAP baseline estimate can be obtained with a Kalman filter, after a simple modification is operated, by which the mean of the measurement probability density is subtracted from the measurement yt+1. Hence, the mean of the measurement density becomes again zero, whereas the variance of the measurement density, rt+1, is a weighted sum of the Markov measurement variances at time t+1:  yt 1 :  yt 1 |  t 1   yt 1    ti   i  [4.3.2.1.10] i     p yt 1 | bt 1   N  0,   ti1   i2   i  [4.3.2.1.11] NS rt 1   ti1   i2 [4.3.2.1.12] i 1 For each time t+1, the baseline initialization method is summarized as: initialize assign b0t+1 := bt repeat expectation: estimate i+1t+1 with Forward-Backward given bit+1 and t 104 maximization: estimate bi+1t+1 with Kalman filter given i+1t+1 and bt until convergence test i+1t+1 is the  quantity at time t+1, calculated in iteration i+1. For the calculation of i+1t+1, the initial state probabilities vector is equal to , for t=0, and to t, for t>0. bi+1t+1 is the baseline at time t+1, calculated in iteration i+1. t is the  quantity at time t, after convergence. bt is the baseline at time t, after convergence. The procedure is simple and has indicated relatively fast convergence in practical applications. 4.3.2.2 Expectation The expectation step of the Forward-Backward-Kalman algorithm calculates the sufficient statistics using the standard Forward-Backward procedure, with a few preparations: yt :  yt | bt   yt  bt [4.3.2.2.1]  i2 :  i2 | Vt    i2  Vt [4.3.2.2.2] In effect, the baseline sequence b is subtracted from the observation sequence Y, and the variance of the emission probabilities of each state is increased to account for the variance in the estimated baseline. The latter is usually very small compared to the state-conditional measurement variance and can be ignored, except maybe at both ends of the data sequence and possibly in areas with unusual baseline changes. 4.3.2.3 Maximization 105 The maximization step estimates the most likely baseline sequence bML given the relevant sufficient statistics from the expectation step, namely the ’s, using a standard Kalman filter-smoother. As in the initialization part described above, a few preparations are necessary:  yt :  yt |  t   yt    ti   i  [4.3.2.3.1] i  rt : rt |  t     ti   i2  [4.3.2.3.2] i In effect, the Markov signal is subtracted from the observation sequence, and the remainder is smoothed with the Kalman filter-smoother to determine the most likely baseline offset sequence. 4.4 Results In this section I test both algorithms: Viterbi-Kalman and Forward-Backward-Kalman. Both have only one adjustable parameter: the variance qB of the baseline process. I refer to qB as SimBaseVar in the context of simulating baseline fluctuations, and as BaseVar in the context of baseline correction. I have simulated data with different signal-to-noise ratios, with and without baseline artifacts, and have compared the idealization and the conductance parameters produced by the algorithm, with the true idealization and true conductance parameters. Thus, I was able to evaluate the performance of the idealization/baseline correction algorithm in terms of state sequence inference and conductance parameters estimation, and to assess the effects that the method has on the estimated kinetic parameters. Rigorously, the signal-to-noise ratio (SNR) is defined as the ratio between the power of the signal and the power of the noise. As the true SNR is difficult to calculate, I prefer to use a 106 more operational definition, approximate but intuitive. Thus, I define the signal-to-noise ratio of the ion channel signal as: SNR  AmpOpen  AmpClose MeasStd Open [4.4.1] I chose the measurement variance corresponding to the open state since it is usually the highest, due to state excess noise, and gives the SNR of the worst case scenario. Obviously, defined as such, the SNR does not depend on the ion-channel kinetics, which is a crude approximation. Also, this SNR is proportional to the square root of the true SNR (since the power is proportional to the variance). I varied the SNR of the simulated data between 10 and 0.5, with 10 corresponding to the easiest and 0.5 to the most difficult case. I expected a low SNR to result in mistakes in the idealized trace, leading to errors in the estimation of both conductance and kinetics parameters. I did not go below SNR 0.5, for the practical reason that data with such low SNR will most likely not be identified as containing any signal at all. I have added to the simulated data two kinds of baseline artifacts, separately. The first kind of artifact consists of fluctuations generated by a simple linear Gaussian model with no measurement noise, with process equation: bt  bt 1  B [4.4.2] The varied parameter is the variance of the process, qB, which I will refer to as SimBaseVar. A zero value of SimBaseVar corresponds to no baseline fluctuations. A larger value results in fluctuations of higher amplitude. The same seed of the random number generator 107 was used to generate the baseline in all experiments, so the shape but not the amplitude of the fluctuations was identical, allowing for direct comparisons. The second kind of fluctuations is a deterministic, periodic noise, in the form of a sinewave with given frequency and amplitude, which I will refer to as SineAmp and SineFreq, respectively. I used three values for SineAmp: 0 (corresponding to no sinewave interference), 0.5 (within the state amplitudes) and 1. For SineFreq I choose two values: 60 Hz and 600 Hz. The 60 Hz corresponds to interference caused by power lines. The sinewave acts in concert with the sampling frequency, which was 100 kHz in all experiments. Since not all patch-clamp data is sampled at 100 kHz, but possibly lower, I choose the 600 Hz frequency to mimic the situation of a 10 kHz sampling and 60 Hz sinewave, with the advantage of keeping the data sequence identical in all experiments. The maximum slope of the 600 Hz sinusoid is ten times larger than that of the 60 Hz sinusoid. Intuitively, it is easy to see how contamination with a lower frequency, relative to the sampling frequency, will be easier to correct. It must be noted that any kind of added baseline artifact leads to a decrease in the effective SNR. The amplitudes of the close and the open conductance classes, which I refer to as Amp0 and Amp1, were in all experiments 0 and 1, respectively. For convenience, I used identical state variances for the two conductance classes, which I refer to as MeasVar to denote the simulation parameter and Std0 and Std1 to denote an estimated parameter. I varied MeasVar of the simulated data in order to change SNR. The seed of the random number generator used by the simulator was the same in each experiment, so the generated state sequence was the same, irrespective of the parameters used in simulation and independent of the baseline artifacts. Direct comparisons of the tested algorithms in terms of state inference, conductance and kinetics parameters estimation are thus possible. The kinetics was estimated with three maximum likelihood methods: MIL (Qin et al, 1996, 1997), MPL (Qin et al, 2000a) and MIP (presented in Chapter 3). 108 Experience shows that idealizing data with Viterbi is feasible only for a relatively good SNR. For low SNR, not only is idealization with Viterbi less reliable, but estimating kinetics from such idealized data (with MIL or MIP) results in gross errors. Forward-Backward and MPL are required in these situations. Therefore, my first goal was to find the working SNR range for Viterbi and for Forward-Backward. Thus, the most efficient kinetic estimation method is applied: at high SNR, MIL or MIP runs fast with idealized data (obtained by either idealization algorithm); at low SNR, MPL runs more slowly using raw data (but baseline corrected). All data was simulated with the model in Figure [4.4.1] and the parameters in Table [4.4.1]. 1 100.0 50.0 2 5000.0 2000.0 3 Figure [4.4.1]. Model used for simulation PtsPerSeg SegCnt 1000000 1 Sampling [kHz] 100 Amp1 0.0 Amp2 1.0 Table [4.4.1]. Simulation parameters PtsPerSeg and SegCnt denote, respectively, the number of points in a data set, and the number of data sets analyzed together. I chose the simulation model so as to generate both fast and slow kinetics in the same data set. Thus the need for separately testing slow and fast kinetics is eliminated. The simulated data presents clusters of activity, separated by longer closed intervals. The models used in idealization and kinetics estimation are shown in Figure [4.4.2]. I had to use the asymmetric model (Figure 4.4.2, B) because with symmetry, the Baum-Welch re109 estimation could not distinguish between the two close states, getting stuck in a suboptimal solution. Although Viterbi assigns equal initial probability to the two close states, only one is chosen as the starting state. In contrast, Forward-Backward does not choose one single state for the initial point, and so the re-estimated kinetics remains symmetric and suboptimal, at any iteration. Instead of starting with a (slightly) asymmetric model, I could have used unequal starting probabilities, with the same effect. In fact, it is generally good practice to start with asymmetric models, therefore avoiding other unpleasant side-effects, such as multiple (repeated, confluent) eigenvalues of the Q matrix. The parameters for algorithms MIP, MIL and MPL are in Tables 4.4.2, 4.4.3 and 4.4.4, respectively. Td is the dead-time correction in MIL. The other parameters are obvious. A 1 100.0 100.0 2 100.0 100.0 3 B 1 100.0 100.0 2 200.0 200.0 3 C 1 1000.0 1000.0 2 Figure [4.4.2]. Models used in idealization and kinetics estimation. A Model used for idealization and re-estimation with Segmental k-means and kinetics estimation. B Model used for idealization and re-estimation with Baum-Welch. C Simplified model used for idealization. 110 MaxStep 0.2 MaxIter 50 LLConv 1e-8 GradConv 1e-8 Table [4.4.2]. MIP parameters MaxStep MaxIter 0.5 100 Td [ms] 0.015 LLConv GradConv 1e-4 1e-4 Table [4.4.3]. MIL parameters MaxStep 0.5 MaxIter 100 LLConv 1e-4 GradConv 1e-4 Table [4.4.4]. MPL parameters 4.4.1 Idealization of simulated data without baseline artifacts In this section I evaluate the performance of signal detection and parameter re-estimation algorithms, with regard to the signal-to-noise ratio. The empirically established operational range of each method indicates the conditions under which to apply it further to baseline correction. The signal detection algorithms are Viterbi, Forward-Viterbi and Forward-Backward. The reestimation algorithms are Segmental k-means and Baum-Welch. For testing signal detection, I use the same model as for simulation, starting with the true kinetics and conductance parameters (Figure [4.4.2], A or B), or a simplified model (Figure [4.4.2], C), with the true conductance parameters. The same strategy is applied to re-estimation. The signal detection performance is quantified by the deviation of the rate constants estimated from the resulting idealization with MIL and MIP algorithms, from the true rate constants. Since the amount of simulated data is finite and the rate estimation methods are not the subject of testing, the true rate constants are defined as those estimated from the simulated 111 idealization. These rates will be slightly different from the mean rates of the simulation model due to the stochastic nature of the Markov signal. The re-estimation performance is assessed in the same way, and additionally, from the deviation of the estimated conductance parameters from the true values. The results obtained with the different signal detection algorithms are presented in Figures [4.4.1.2] and [4.4.1.3]. With the true model, Viterbi gives the worst results, producing a very good idealization at SNR ≥ 5 and acceptable down to SNR 2. Below 2, Viterbi is unusable. Forward-Viterbi is somewhat better but still not usable at or below SNR 1. In contrast, ForwardBackward gives acceptable idealization even at SNR 0.5. With a simplified C-O model, both Viterbi and Forward-Backward (with a two-state model, Forward-Viterbi reduces to Viterbi) start losing accuracy when the contribution of the kinetics to the likelihood function becomes important, below SNR 5. The results of the re-estimation algorithms are presented in Figures [4.4.1.4] and [4.4.1.5]. The re-estimation of conductance parameters with Baum-Welch has an error of one to two orders of magnitude smaller than with Segmental k-means. The idealization performance is qualitatively similar to that of the signal detection procedure alone, but worse, since both Viterbi and Forward-Backward have to work with more or less inaccurate parameters. In both cases MPL performs better than either MIL or MIP at lower SNR, when the complete data likelihood of one sequence of classes is no longer equivalent to the incomplete data likelihood of all possible sequences. MIL performs slightly better than MIP due to the missed event correction capability. I conclude that either Viterbi alone or Segmental k-means should be limited to SNR ≥ 5, when they are more efficient in terms of computational speed, memory requirements and convergence rate (results not shown). For lower signal to noise ratios, or when high accuracy of conductance parameters is required, either Forward-Backward alone or Baum-Welch should be employed, at the cost of lower efficiency. MIL can be safely used down to SNR 1 if the idealization is obtained with Baum-Welch. Below SNR 1 only MPL works. At SNR ≥ 5, a 112 simplified model is enough for a quality idealization but a more accurate model should be used beyond that, especially when the eigenvalues of the generating model span many orders of magnitude. Based on these conclusions, I limit application of Viterbi-Kalman to baseline correction and idealization of data with SNR ≥ 5, and of Forward-Backward-Kalman to data with SNR < 5. 113 Segmental k-means SNR 10 1 pA 0.5 pA 50 ms Baum-Welch 1 pA 5 ms 2 pA 5 ms 2 pA 1 pA SNR 5 5 ms 50 ms 2 pA 2 pA SNR 2 5 ms 2 pA 5 ms 50 ms 5 pA 5 pA SNR 1 5 pA 5 ms 50 ms 5 ms 5 ms Figure [4.4.1.1]. Simulated data without baseline artifacts, at different SNR, idealized with Segmental k-means and Baum-Welch. At SNR 1, the signal is almost buried in noise. SKM makes considerably more errors than BW. 114 MIP MIL 10000 10000 Viterbi K12 K12 K21 1000 K23 K21 1000 K23 K32 K32 TrueK12 100 TrueK21 TrueK12 100 TrueK21 TrueK23 TrueK23 TrueK32 10 10 5 2 1 0.5 10 10000 ForwardViterbi TrueK32 10 5 2 1 0.5 10000 K12 K21 1000 K23 K12 K21 1000 K23 K32 K32 TrueK12 100 TrueK21 TrueK12 100 TrueK21 TrueK23 TrueK23 TrueK32 10 10 ForwardBackward TrueK32 10 5 2 1 0.5 10 5 2 1 0.5 10000 10000 K12 K12 K21 1000 K23 K21 1000 K23 K32 K32 TrueK12 TrueK12 6 100 TrueK21 100 TrueK21 TrueK23 TrueK23 TrueK32 TrueK32 10 10 10 5 2 1 10 0.5 SNR 5 2 1 0.5 SNR Figure [4.4.1.2]. Kinetics estimation from simulated data at different SNR, without baseline artifacts. Data idealized with Viterbi, Forward-Viterbi or Forward-Backward, using the true model. The kinetics were estimated from the idealization with MIL or MIP. 115 MIP MIL 10000 10000 K12 K12 Viterbi K21 1000 K23 K21 1000 K23 K32 K32 TrueK12 TrueK21 100 TrueK12 TrueK21 100 TrueK23 TrueK23 TrueK32 TrueK32 10 10 10 5 2 1 0.5 10 10000 5 2 1 0.5 10000 ForwardBackward K12 K12 K21 1000 K23 K21 1000 K23 K32 K32 TrueK12 TrueK21 100 TrueK12 TrueK21 100 TrueK23 TrueK23 TrueK32 TrueK32 10 10 10 5 2 1 0.5 10 SNR 5 2 1 0.5 SNR Figure [4.4.1.3]. Kinetics estimation from simulated data at different SNR, without baseline artifacts. Data idealized with Viterbi or Forward-Backward, with a simplified C-O model. The kinetics were estimated from the idealization with MIL or MIP. 116 A B Segmental k-means 0.004 0.3 (Meas-True)/True Baum-Welch 0.25 0.003 0.2 Amp0 Std0 0.15 Amp1 0.1 Amp0 0.002 Std0 Amp1 0.001 Std1 Std1 0 0.05 -0.001 0 -0.002 -0.05 10 5 2 1 10 0.5 5 2 1 0.5 Figure [4.4.1.4]. Re-estimation of conductance parameters, from simulated data at different SNR, without baseline artifacts. Data idealized with Segmental k-means or Baum-Welch, using the true model. Notice the different scale of the y axis, in A and B. Conductance parameters estimation is much more accurate with BW than SKM. 117 Baum-Welch Segmental k-means 10000 10000 K12 1000 MIL K21 K21 K23 K23 K32 K32 TrueK12 100 K12 1000 TrueK21 TrueK12 100 TrueK21 TrueK23 TrueK23 TrueK32 10 TrueK32 10 10 5 2 1 0.5 10 10000 5 2 1 0.5 10000 K12 K21 MPL 1000 K21 K23 K23 K32 K32 TrueK12 100 K12 1000 TrueK21 TrueK12 100 TrueK21 TrueK23 TrueK23 TrueK32 10 TrueK32 10 10 5 2 1 0.5 10 10000 5 2 1 0.5 10000 K12 K21 MIP 1000 K12 1000 K21 K23 K23 K32 K32 TrueK12 100 TrueK21 TrueK12 100 TrueK21 TrueK23 TrueK23 TrueK32 10 TrueK32 10 10 5 2 1 0.5 10 SNR 5 2 1 0.5 SNR Figure [4.4.1.5]. Kinetics estimation from simulated data at different SNR, without baseline artifacts. Data idealized with Segmental k-means or Baum-Welch, using the true model. The kinetics were estimated from the idealized data with MIL or MIP, or from the raw data with MPL (using the reestimated conductance parameters). Kinetics estimation from data idealized with BW is much more accurate than with SKM. 118 4.4.2 Idealization and baseline correction of simulated data with stochastic baseline fluctuations The amplitude of the baseline fluctuations is determined by the process variance qB. I have simulated data with increasing fluctuations amplitude, with SimBaseStd 0, 0.005, 0.01, 0.02, 0.05 and 0.1 (SimBaseStd is the square root of SimBaseVar). A SimBaseStd of 0.05 and 0.1 results in extreme baseline artifacts, which no single channel data is likely to suffer. The appearance of the simulated data with SimBaseStd 0.005 and 0.01 resembles real data, with enough baseline fluctuations to render the existing analysis algorithms unusable. Obviously, these numbers are only meaningful with respect to the actual difference in conductance class amplitudes (in this case 1 pA). Figures [4.4.2.1] through [4.4.2.6] show the performance of the baseline correction in the time domain. A low and a high resolution views are presented, selected arbitrarily from the data file (the same selection for all figures). In each figure, the A view has the data with baseline fluctuations and B has the data without baseline fluctuations (so-called true data). The other views (C, D etc) have the baseline corrected data, with different methods and possibly different values of BaseStd (the square root of BaseVar). The high resolution view for the true and corrected data has overlapped the detected signal. Unless otherwise specified, the BaseStd parameter for correction was the same as the SimBaseStd parameter for simulation. Viterbi-Kalman is used only to correct data with SNR 5 (Figures [4.4.2.1] through [4.4.2.4]). Forward-Backward-Kalman is used to correct data with SNR 5 and SimBaseStd 0.01 and below (Figures [4.4.2.2] through [4.4.2.6]). At SNR 5, VK makes practically no visible mistakes up to SimBaseStd 0.05, and fails almost completely at SimBaseStd 0.1. FBK starts making visible mistakes at SimBaseStd 0.05, by confounding the open class with the closed class 119 and vice-versa. However, data simulated with SimBaseStd above 0.05 is unlikely to match any real data, and if it does, hopefully it lasts only for short segments. Below SNR 5, the correction is more difficult to inspect, as the signal gets buried in noise. The idealized data in the high resolution view indicates that FBK does a fine job at baseline correcting and idealizing data. Relative to the Forward-Backward idealization of the true data, the FBK idealization is very good up to SimBaseStd 0.05 at SNR 2 and SimBaseStd 0.02 at SNR 1. Some false positives and false negatives are nevertheless present, with many more at higher values of SimBaseStd. The next set of figures, [4.4.2.7] through [4.4.2.10], shows the performance of the baseline correction in the frequency domain. Of data at SNR 5, only the spectra for ViterbiKalman corrected data are shown, in Figure [4.4.2.7]. Of data at SNR 2 and 1, the spectra for Forward-Backward-Kalman corrected data are shown in Figures [4.4.2.8] and [4.4.2.9], respectively. Not surprisingly, the spectral results match their time domain counterpart. The spectra of baseline corrected data are almost identical with the true data spectra, except a high value of SimBaseStd (0.05). Figure [4.4.2.10] presents the effect of BaseStd in the frequency domain. Simulated data at SNR 5, with SimBaseStd 0.02, is baseline corrected with ViterbiKalman and Forward-Backward-Kalman, using different values for the BaseStd parameter. VK appears to be more sensitive, whereas FBK produces almost identical spectra. It also seems that overestimation of BaseStd is worse than underestimation. More on this issue follows. The next two figures, [4.4.2.11] and [4.4.2.12], show the re-estimation of conductance parameters from baseline corrected data. Figure [4.4.2.11] shows that the re-estimation error is approximately equal for Viterbi-Kalman and Forward-Backward-Kalman. Considering that Baum-Welch has much smaller errors than Segmental-k means when used for data without baseline artifacts, it can be inferred that the error is mostly due to the baseline fluctuations. It is immediately apparent that within the operational range of each method the conductance parameters are underestimated. The elimination of baseline fluctuations seems to have an effect 120 similar to that of a low-pass filter. The amplitude of the open class is lowered, and the state conditional noise is reduced. Outside the operational range (at high SimBaseStd), the amplitude of the open class becomes unpredictable, while the state noise first drops and then rises. The overestimation of noise is explained by classification errors, as misclassified points result in increased variance. Figure [4.4.2.12] shows the effect of BaseStd on conductance parameters. Data simulated at SNR 5, with baseline fluctuations of SimBaseStd 0.02, is baseline corrected and idealized with VK and FBK, with different values for BaseStd. Lower, as well as higher values result in very slight underestimation of the amplitude. Lower values give about 5% overestimation, while higher values result in about 10% underestimation of the state noise. The last set of figures, [4.4.2.13] through [4.4.2.15], shows the estimation of kinetics parameters from baseline corrected data. The four rate constants are plotted as a function of SimBaseStd in Figures [4.4.2.13] and [4.4.2.14] and as a function of BaseStd in [4.4.2.15]. Both the estimated and the true rate constants are shown, in a logarithmic plot. The true rates are actually the rates estimated from data without baseline artifacts, using the corresponding idealization (Segmental k-means and Baum-Welch) and rate estimation (MIL, MPL and MIP) algorithm. A first conclusion is that MPL has better accuracy than MIL or MIP, irrespective of the baseline correction method, SNR or amount of baseline fluctuations. MIL is only slightly better than MIP, and mostly when associated with Viterbi-Kalman. With FBK-corrected data, relatively accurate rates can be estimated from data with SimBaseStd as high as 0.05 for SNR 2 and 0.02 for SNR 1. At SNR 5, both VK and FBK are good up to SimBaseStd 0.05. 121 1 pA A B 1 pA 50 ms 2 ms 1 pA 1 pA 2 ms 50 ms C 1 pA 1 pA 2 ms 50 ms Figure [4.4.2.1]. Simulated data at SNR 5 with stochastic baseline artifacts at two different time resolutions. A Simulated data with SimBaseStd 0.01. B True data with true idealization. C Data in A idealized and baseline corrected with Viterbi-Kalman. 122 A 1 pA 2 pA 50 ms 2 ms 2 pA B 2 pA 2 ms 50 ms 2 pA C D 2 pA 2 ms 50 ms 2 pA 5 pA 2 ms 50 ms Figure [4.4.2.2]. Simulated data at SNR 5 with stochastic baseline artifacts at two different time resolutions. A Simulated data with SimBaseStd 0.02. B True data with true idealization. C Data in A idealized and baseline corrected with Viterbi-Kalman. D data in A idealized and baseline corrected with Forward-Backward-Kalman. 123 10 pA A 2 pA 50 ms 2 ms 2 pA 2 pA B 2 ms 50 ms 2 pA 2 pA C 2 ms 50 ms 2 pA 2 pA D F 2 ms 50 ms 2 pA 2 pA E G 50 ms 2 ms Figure [4.4.2.3]. Simulated data at SNR 5 with stochastic baseline artifacts at two different time resolutions. A Simulated data with SimBaseStd 0.05. B True data with true idealization. C Data in A idealized and baseline corrected with Viterbi-Kalman, with BaseStd 0.05. D Data in A idealized and baseline corrected with Forward-Backward-Kalman, with BaseStd 0.05. E Data in A, with ForwardBackward-Kalman, with BaseStd 0.02. F, G Example of FBK thrown off by a spike in the data. This could be avoided by slight filtering. 124 10 pA A 2 pA 50 ms 2 ms 2 pA B C D 2 pA 2 ms 50 ms 2 pA 2 pA 2 ms 50 ms 2 pA 2 pA 2 ms 50 ms Figure [4.4.2.4]. Simulated data at SNR 5 with stochastic baseline artifacts at two different time resolutions. A Simulated data with SimBaseStd 0.1. B True data with true idealization. C Data in A idealized and baseline corrected with Viterbi-Kalman, with BaseStd 0.05. D Data in A idealized and baseline corrected with Forward-Backward-Kalman, with BaseStd 0.05. 125 5 pA A B 2 pA 2 ms 20 ms 2 pA 5 pA 2 ms 20 ms 5 pA C 10 pA 2 ms 20 ms 2 pA D E F G 2 ms 2 pA 20 ms 2 pA 2 pA 2 ms 20 ms 2 pA 2 pA 2 ms 20 ms 2 pA 2 pA 2 ms 20 ms Figure [4.4.2.5]. Simulated data at SNR 2 with stochastic baseline artifacts at two different time resolutions. A Simulated data with SimBaseStd 0.01. B SBS 0.02. C SBS 0.05. Notice the change is amplitude scales. D True data, idealized with FB. E Data in A idealized and baseline corrected with FBK. F Data in B, with FBK. G Data in C, with FBK. 126 5 pA A 5 pA 2 ms 20 ms 5 pA 5 pA B 2 ms 20 ms 5 pA 10 pA C 2 ms 20 ms 5 pA 5 pA 2 ms D 20 ms 5 pA 5 pA E 2 ms 20 ms 5 pA 5 pA F 2 ms 20 ms 5 pA 5 pA G 2 ms 20 ms Figure [4.4.2.6]. Simulated data at SNR 1 with stochastic baseline artifacts at two different time resolutions. A Simulated data with SimBaseStd 0.01. B SBS 0.02. C SBS 0.05. Notice the change is amplitude scales. D True data, idealized with FB. E Data in A idealized and baseline corrected with FBK. F Data in B, with FBK. G Data in C, with FBK. 127 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 8 6 4 2 0 8 4 2 0 1 8 2 3 log10 (Frequency [Hz]) 6 4 2 0 0 1 2 3 log10 (Frequency [Hz]) 6 4 2 0 0 1 2 3 log10 (Frequency [Hz]) E 6 4 2 0 0 1 8 B 2 3 log10 (Frequency [Hz]) 4 F 6 4 2 0 4 0 1 8 C 2 3 log10 (Frequency [Hz]) 4 G 6 4 2 0 4 Log10 (pA^2/Hz) 8 A 4 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 2 3 log10 (Frequency [Hz]) 6 0 Log10 (pA^2/Hz) 1 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 0 8 0 1 8 D 2 3 log10 (Frequency [Hz]) 4 H 6 4 2 0 4 0 1 2 3 log10 (Frequency [Hz]) 4 Figure [4.4.2.7]. Power spectra of simulated data at SNR 5 with stochastic baseline fluctuations, baseline corrected with Viterbi-Kalman. A Simulated data with SimBaseStd 0.05. B SimBaseStd 0.02. C SimBaseStd 0.01. D, H True data. E Data in A baseline corrected. F Data in B baseline corrected. G Data in C baseline corrected. 128 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 6 4 2 0 4 2 0 1 2 3 log10 (Frequency [Hz]) 6 4 2 0 1 2 3 log10 (Frequency [Hz]) 6 4 2 0 0 1 2 3 log10 (Frequency [Hz]) 4 2 0 0 B 1 2 3 log10 (Frequency [Hz]) 4 F 6 4 2 0 4 0 1 2 3 log10 (Frequency [Hz]) 6 C 4 G 4 2 0 4 Log10 (pA^2/Hz) 0 E 6 4 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 2 3 log10 (Frequency [Hz]) 6 0 Log10 (pA^2/Hz) 1 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 0 A 0 1 2 3 log10 (Frequency [Hz]) 6 D 4 H 4 2 0 4 0 1 2 3 log10 (Frequency [Hz]) 4 Figure [4.4.2.8]. Power spectra of simulated data at SNR 2 with stochastic baseline fluctuations, baseline corrected with ForwardBackward-Kalman. A Simulated data with SimBaseStd 0.05. B SimBaseStd 0.02. C SimBaseStd 0.01. D, H True data. E Data in A baseline corrected. F Data in B baseline corrected. G Data in C baseline corrected. 129 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 6 4 2 0 4 2 0 1 2 3 log10 (Frequency [Hz]) 6 4 2 0 1 2 3 log10 (Frequency [Hz]) 6 4 2 0 0 1 2 3 log10 (Frequency [Hz]) 4 2 0 0 B 1 2 3 log10 (Frequency [Hz]) 4 F 6 4 2 0 4 0 C 1 2 3 log10 (Frequency [Hz]) 4 G 6 4 2 0 4 Log10 (pA^2/Hz) 0 E 6 4 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 2 3 log10 (Frequency [Hz]) 6 0 Log10 (pA^2/Hz) 1 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 0 A 0 D 1 2 3 log10 (Frequency [Hz]) 4 H 6 4 2 0 4 0 1 2 3 log10 (Frequency [Hz]) 4 Figure [4.4.2.9]. Power spectra of simulated data at SNR 1 with stochastic baseline fluctuations, baseline corrected with ForwardBackward-Kalman. A Simulated data with SimBaseStd 0.05. B SimBaseStd 0.02. C SimBaseStd 0.01. D, H True data. E Data in A baseline corrected. F Data in B baseline corrected. G Data in C baseline corrected. 130 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 8 6 4 2 0 8 4 2 0 1 8 2 3 log10 (Frequency [Hz]) 6 4 2 0 0 1 2 3 log10 (Frequency [Hz]) 6 4 2 0 0 1 2 3 log10 (Frequency [Hz]) E 6 4 2 0 0 1 8 B 2 3 log10 (Frequency [Hz]) 4 F 6 4 2 0 4 0 1 8 C 2 3 log10 (Frequency [Hz]) 4 G 6 4 2 0 4 Log10 (pA^2/Hz) 8 A 4 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 2 3 log10 (Frequency [Hz]) 6 0 Log10 (pA^2/Hz) 1 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 0 8 0 1 8 D 2 3 log10 (Frequency [Hz]) 4 H 6 4 2 0 4 0 1 2 3 log10 (Frequency [Hz]) 4 Figure [4.4.2.10]. Power spectra of simulated data at SNR 5 with stochastic baseline fluctuations, using SimBaseStd 0.02, baseline corrected with Viterbi-Kalman and Forward-Backward-Kalman, with different choices of BaseStd. A Corrected with VK, BaseStd 0.05. B Corrected with VK, BaseStd 0.02. C Corrected with VK, BaseStd 0.01. D Corrected with VK, BaseStd 0.005. E Corrected with FBK, BaseStd 0.05. F Corrected with FBK, BaseStd 0.02. G Corrected with FBK, BaseStd 0.01. H Corrected with FBK, BaseStd 0.005. Notice that the method seems quite robust and insensitive to the choice of BaseStd. 131 A (Meas-True)/True B SNR 5 0.1 0.1 0.05 Std0 Amp1 0 0.05 Std0 Amp1 Std1 0 -0.05 -0.1 0 C Std1 -0.05 -0.1 (Meas-True)/True SNR 5 0.15 0 0.01 0.02 0.05 0.1 D SNR 2 0.01 0.005 0 -0.005 -0.01 -0.015 -0.02 -0.025 -0.03 -0.035 Std0 0.05 0.1 E SNR 1 0.1 0.08 0.003 0.06 0.002 Std0 0.04 Amp1 Amp1 Std1 0.02 Std1 0 0.01 0.02 BaseStd 0.05 0.001 Std0 0 -0.001 0 -0.002 -0.02 -0.003 -0.04 SNR 0.5 0.004 Amp1 Std1 -0.004 0 0.01 0.02 BaseStd 0.05 0 0.01 BaseStd Figure [4.4.2.11]. Re-estimation of conductance parameters data simulated at different SNR, with stochastic baseline artifacts. A Data idealized and baseline corrected with Viterbi-Kalman. B, C, D, E Data idealized and baseline corrected with Forward-Backward-Kalman. The baseline correction and idealization procedure replaces the standard method, in Segmental k-means and Baum-Welch, respectively. For BaseStd ≤ 0.02, conductance parameters are underestimated by less than 5%. For BaseStd > 0.02 (large baseline fluctuations), the data is not idealized correctly, thus conductance parameters may be overestimated (especially noise). 132 (Meas-True)/True A B Viterbi-Kalman 0.06 0.04 0.02 0 -0.02 -0.04 -0.06 -0.08 -0.1 -0.12 Std0 Amp1 Std1 0.005 0.01 0.02 Forward-Backward-Kalman 0.06 0.04 0.02 0 -0.02 -0.04 -0.06 -0.08 -0.1 -0.12 Std0 Amp1 Std1 0.005 0.05 BaseStd 0.01 0.02 0.05 BaseStd Figure [4.4.2.12]. Conductance parameter re-estimation from simulated data at SNR 5 using stochastic baseline fluctuations, with SimBaseStd 0.02, baseline corrected with Viterbi-Kalman and Forward-Backward-Kalman, with different BaseStd. A Data baseline corrected with Viterbi-Kalman. B Data baseline corrected with Forward-Backward-Kalman. For both algorithms, the errors in amplitude estimation are very small and insensitive to the choice of BaseStd. For low BaseStd, not all the fluctuations are tracked, thus noise is overestimated. For high BaseStd, part of the state noise is attributed to baseline fluctuations, thus noise is underestimated. 133 A B C MIL MPL MIP Viterbi-Kalman 10000 10000 K12 K12 1000 1000 K21 K23 K32 K32 TrueK12 100 K21 K23 TrueK12 100 TrueK21 TrueK21 TrueK23 TrueK23 TrueK32 TrueK32 10 10 0 0 0.01 0.02 0.05 0.1 Forward-Backward-Kalman D E 10000 F 10000 10000 K12 1000 K21 K21 0.05 0.1 Sim BaseStd K21 K23 K32 K32 K32 TrueK12 100 TrueK21 TrueK23 TrueK23 TrueK32 TrueK32 10 0 K12 1000 K23 TrueK21 10 K12 1000 K23 TrueK12 100 0.01 0.02 0.05 0.1 TrueK12 100 TrueK21 TrueK23 TrueK32 10 0 0.05 0.1 Sim BaseStd 0 0.05 0.1 Sim BaseStd Figure [4.4.2.13]. Kinetic parameter estimation from data simulated at SNR 5 with stochastic baseline fluctuations. Data simulated with different SimBaseStd is baseline corrected and idealized with Viterbi-Kalman and Forward-Backward-Kalman. The kinetics parameters are estimated with MIL and MIP from idealized data and with MPL from raw data. The thick lines are the parameters estimated from baseline corrected data, while thin lines are parameters estimated from true data, using the same procedures. For SimBaseStd ≤ 0.05, the baseline correction does not cause any visible alteration of kinetics. At low SimBaseStd, VK and FBK correct almost identically, while at high SimBaseStd FBK is significantly better. MIL, MPL and MIP give nearly identical estimates. 134 C B A SNR 2 10000 10000 K12 1000 K12 1000 K21 100 K23 K32 K32 K32 TrueK12 TrueK12 100 TrueK23 TrueK23 TrueK32 TrueK32 0.02 0.05 10000 0.01 0.02 0.05 TrueK23 TrueK32 100 0.01 0.02 0.05 Sim BaseStd 0.05 K12 K12 1000 K21 1000 K21 K23 K23 K32 K32 K32 TrueK12 TrueK12 100 TrueK21 TrueK23 TrueK23 TrueK32 TrueK32 TrueK12 100 TrueK21 TrueK23 TrueK32 10 10 0 0.02 K23 TrueK21 10 0.01 10000 K12 K21 0 F 10000 1000 TrueK21 10 0 E TrueK12 100 TrueK21 10 0.01 K21 K23 TrueK21 0 D K12 1000 K21 K23 10 SNR 1 MIP MPL MIL 10000 0 0.01 0.02 Sim BaseStd 0.05 0 0.01 0.02 0.05 Sim BaseStd Figure [4.4.2.14]. Kinetic parameter estimation from data simulated at SNR 2 and 1 with stochastic baseline fluctuations. Data simulated with different SimBaseStd is baseline corrected and idealized with Forward-Backward-Kalman. The kinetics parameters are estimated with MIL and MIP from idealized data and with MPL from raw data. The thick lines are the parameters estimated from baseline corrected data, while thin lines are parameters estimated from true data, using the same procedures. For SimBaseStd ≤ 0.02, the baseline correction does not cause any visible alteration of kinetics, at either SNR 2 or 1. Regardless of SimBaseStd, at SNR 1 only MPL gives reliable kinetic estimates, as expected. At SNR 2, MPL is more accurate, but MIL and MIP are close. 135 A B C MIL MPL Viterbi-Kalman 10000 K12 K12 1000 K21 100 1000 K23 K32 K32 K32 TrueK12 TrueK12 100 TrueK23 TrueK23 TrueK32 TrueK32 0.05 10000 0.02 0.05 TrueK23 TrueK32 100 0.02 BaseStd 0.05 K12 K12 1000 K21 1000 K21 K23 K23 K32 K32 K32 TrueK12 TrueK12 100 TrueK21 TrueK23 TrueK23 TrueK32 TrueK32 TrueK12 100 TrueK21 TrueK23 TrueK32 10 10 0.005 0.01 0.05 K23 TrueK21 10 0.02 10000 K12 K21 0.005 0.01 F 10000 1000 TrueK21 10 0.005 0.01 E TrueK12 100 TrueK21 10 0.02 K21 K23 TrueK21 0.005 0.01 D K12 1000 K21 K23 10 Forward-Backward-Kalman MIP 10000 10000 0.005 0.01 0.02 BaseStd 0.05 0.005 0.01 0.02 0.05 BaseStd Figure [4.4.2.15]. Kinetic parameter estimation from data simulated at SNR 5, with stochastic baseline fluctuations using SimBaseStd 0.02. Data is baseline corrected and idealized with Viterbi-Kalman and Forward-Backward-Kalman, with different BaseStd. The kinetics parameters are estimated with MIL and MIP from idealized data and with MPL from raw data. The thick lines are the parameters estimated from baseline corrected data, while thin lines are parameters estimated from true data, using the same procedures. When data is corrected with VK, too low or too high values of BaseStd cause some small inaccuracies in kinetic estimation, with any of MIL, MPL or MIP (with MIP slightly worse). When FBK is used, the estimated kinetics is very accurate, irrespective of the optimization method. 136 4.4.3 Idealization and baseline correction of simulated data contaminated with sinewave artifacts The two baseline correction algorithms are optimal for stochastic baseline fluctuations generated by a linear Gaussian model. It is of considerable interest to determine how well sinewave artifacts can be removed. To do so, I have designed and performed experiments similar to those described in the preceding chapter, with the difference that the baseline artifacts were in the form of sinewave interference. The amplitude (0.5 or 1.0) and the frequency (60 or 600 Hz) of the sinewave, relative to the open class amplitude and sampling frequency, respectively, should cover most practical, even extreme situations. The parameter BaseStd of the two baseline correction algorithms has no longer a direct meaning. Empirically, I found and used the values that worked best (results not shown). The optimal values for BaseStd apparently depend mostly on SineAmp and SineFreq, irrespective of SNR. They are: 0.01 for SineAmp0.5 and SineFreq 60, 0.02 for SineAmp 1 and SineFreq 60 and 0.05 for SineAmp 0.5 or 1 and SineFreq 600. Figures [4.4.3.1] through [4.4.3.8] show the performance of the baseline correction in the time domain. Viterbi-Kalman is used to correct only data with SNR 5 (Figures [4.4.3.1] and [4.4.3.2]). Forward-Backward-Kalman is used to correct data with SNR 5 and below (Figures [4.4.3.3] through 4.4.3.8]). At SNR 5, VK makes practically no mistakes for SineFreq 60 Hz and SineAmp0.5 and 1.0. For SineFreq 600 Hz, some errors appear at SineAmp 0.5, and they become quite serious at SineAmp 1.0. FBK shows identical performance at SNR 5, with serious errors visible only for SineFreq 600 Hz and SineAmp 1.0. At lower SNR, FBK performs again very well. The limits seem to be SNR 2 with SineFreq 600 Hz and SineAmp 1, and SNR 1 with SineFreq 600 and SineAmp 0.5. These limits approximately match the point where the trained electrophysiologist would not hesitate to abort the patch and move on. 137 The next set of figures, [4.4.3.9] through [4.4.3.13], shows the performance of the baseline correction in the frequency domain. In all but the extreme situations, the 60 or 600 Hz component is eliminated, without visible alterations in the overall spectrum. At SNR 5, SineFreq 600, SineAmp 0.5 (but surprisingly not SineAmp 1.0), both VK and FBK leave some 600 Hz residual. At SNR 2, SineFreq 600 Hz, the residual left by FBK is much more pronounced, especially at SineAmp 1.0. At SNR 1, the periodic component leftover is serious, even at SineFreq 60 Hz and SineAmp 0.5 and 1.0. The next two figures, [4.4.3.14] and [4.4.3.15], show the re-estimation of conductance parameters from baseline corrected data. The re-estimation error is approximately equal for Viterbi-Kalman and Forward-Backward-Kalman. The parameters are mostly underestimated, except outside the operational limits, when noise overestimation occurs. Again, the elimination of baseline fluctuations seems to have an effect similar to that of a low-pass filter. The last set of figures, [4.4.3.16] through [4.4.3.19], shows the estimation of kinetics parameters from baseline corrected data. The four rate constants are plotted as a function of SineAmp. The results are similar. Wherever the baseline correction succeeds, the estimated kinetics are quite accurate. MPL is better than MIL or MIP. MIL is somewhat better than MIP, and mostly when associated with Viterbi-Kalman. 138 A B 2 pA 2 pA 2 ms 50 ms 2 pA 2 pA 50 ms 2 ms 2 pA C 2 pA 2 ms 50 ms 2 pA D 2 pA 2 ms 50 ms 2 pA E 2 pA 2 ms 50 ms Figure [4.4.3.1]. Simulated data at SNR 5 with sinewave baseline artifacts of SineFreq 60 Hz and different SineAmp, idealized and baseline corrected with Viterbi-Kalman. A Simulated data with SineAmp 0.5. B SineAmp 1.0. C Raw data without sinewave, idealized with Viterbi. D Data in A baseline corrected and idealized, with BaseStd 0.01. E. Data in B baseline corrected and idealized, with BaseStd 0.02. There are no visible mistakes. 139 2 pA 2 pA A 50 ms 2 ms 2 pA 2 pA B 50 ms 2 ms 2 pA C 2 pA 2 ms 50 ms 2 pA D 1 pA 2 ms 50 ms 2 pA E 2 pA 2 ms 50 ms Figure [4.4.3.2]. Simulated data at SNR 5 with sinewave baseline artifacts of SineFreq 600 Hz and different SineAmp, idealized and baseline corrected with Viterbi-Kalman. A Simulated data with SineAmp 0.5. B SineAmp 1.0. C Raw data without sinewave, idealized with Viterbi. D Data in A baseline corrected and idealized, with BaseStd 0.05. E. Data in B baseline corrected and idealized, with BaseStd 0.05. There are some visible mistakes for SineAmp 1.0, when the sinusoid upward slope can be mistaken for a channel opening. 140 A B 2 pA 2 pA 2 ms 50 ms 2 pA 2 pA 50 ms 2 ms 2 pA C 2 pA 2 ms 50 ms 2 pA D 2 pA 2 ms 50 ms 2 pA E 2 pA 2 ms 50 ms Figure [4.4.3.3]. Simulated data at SNR 5 with sinewave baseline artifacts of SineFreq 60 Hz and different SineAmp, idealized and baseline corrected with Forward-Backward-Kalman. A Simulated data with SineAmp 0.5. B SineAmp 1.0. C Raw data without sinewave, idealized with Forward-Backward. D Data in A baseline corrected and idealized, with BaseStd 0.01. E. Data in B baseline corrected and idealized, with BaseStd 0.02. There are no visible mistakes. 141 2 pA 2 pA A 50 ms 2 ms 2 pA 2 pA B 50 ms 2 ms 2 pA C 2 pA 2 ms 50 ms 2 pA D 2 pA 2 ms 50 ms 2 pA E 2 pA 2 ms 50 ms Figure [4.4.3.4]. Simulated data at SNR 5 with sinewave baseline artifacts of SineFreq 600 Hz and different SineAmp, idealized and baseline corrected with Forward-Backward-Kalman. A Simulated data with SineAmp 0.5. B SineAmp 1.0. C Raw data without sinewave, idealized with Forward-Backward. D Data in A baseline corrected and idealized, with BaseStd 0.05. E. Data in B baseline corrected and idealized, with BaseStd 0.05. There are some visible mistakes for SineAmp 1.0, when the sinusoid upward slope can be mistaken for a channel opening. Conclusions? 142 A 2 pA 2 pA 50 ms 2 ms 5 pA 5 pA B C D E 50 ms 2 ms 2 pA 2 pA 2 ms 50 ms 2 pA 2 pA 2 ms 50 ms 2 pA 2 pA 2 ms 50 ms Figure [4.4.3.5]. Simulated data at SNR 2 with sinewave baseline artifacts of SineFreq 60 Hz and different SineAmp, idealized and baseline corrected with Forward-Backward-Kalman. A Simulated data with SineAmp 0.5. B SineAmp 1.0. C Raw data without sinewave, idealized with Forward-Backward. D Data in A baseline corrected and idealized, with BaseStd 0.01. E. Data in B baseline corrected and idealized, with BaseStd 0.02. There are some visible mistakes, as expected for this low SNR, almost irrespective of the sonusoid. 143 5 pA 2 pA A 50 ms 2 ms 50 ms 2 pA 2 ms 5 pA B C 2 pA D 2 pA E 2 pA 2 pA 2 ms 50 ms 2 pA 50 ms 2 ms 50 ms 2 pA 2 ms Figure [4.4.3.6]. Simulated data at SNR 2 with sinewave baseline artifacts of SineFreq 600 Hz and different SineAmp, idealized and baseline corrected with Forward-Backward-Kalman. A Simulated data with SineAmp 0.5. B SineAmp 1.0. C Raw data without sinewave, idealized with Forward-Backward. D Data in A baseline corrected and idealized, with BaseStd 0.05. E. Data in B baseline corrected and idealized, with BaseStd 0.05. For SineAmp 1.0, the baseline tracking totally fails, as the sinusoid is mistaken for channel openings. 144 A B C D E 5 pA 5 pA 2 ms 50 ms 5 pA 5 pA 2 ms 50 ms 5 pA 5 pA 2 ms 50 ms 5 pA 5 pA 2 ms 50 ms 5 pA 5 pA 2 ms 50 ms Figure [4.4.3.7]. Simulated data at SNR 1 with sinewave baseline artifacts of SineFreq 60 Hz and different SineAmp, idealized and baseline corrected with Forward-Backward-Kalman. A Simulated data with SineAmp 0.5. B SineAmp 1.0. C Raw data without sinewave, idealized with Forward-Backward. D Data in A baseline corrected and idealized, with BaseStd 0.01. E. Data in B baseline corrected and idealized, with BaseStd 0.02. The idealization mistakes are as expected for this low SNR, with some additional caused by the sinusoid. 145 A B C D 5 pA 5 pA 50 ms 2 ms 5 pA 5 pA 2 ms 50 ms 5 pA 5 pA 2 ms 50 ms 5 pA 5 pA 2 ms 50 ms Figure [4.4.3.8]. Simulated data at SNR 1 with sinewave baseline artifacts of SineFreq 600 Hz and different SineAmp, idealized and baseline corrected with Forward-Backward-Kalman. A Simulated data with SineAmp 0.5. B SineAmp 1.0. C Raw data without sinewave, idealized with Forward-Backward. D Data in A baseline corrected and idealized, with BaseStd 0.05. For SineAmp 0.5, there are significant mistakes caused by the sinusoid. Data in B, with SineAmp 1.0, could not be idealized. 146 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 3 2 1 0 -1 2 1 0 -1 1 2 3 log10 (Frequency [Hz]) 3 2 1 0 -1 1 2 3 log10 (Frequency [Hz]) 3 2 1 0 -1 0 1 2 3 log10 (Frequency [Hz]) 2 1 0 -1 0 B 1 2 3 log10 (Frequency [Hz]) 2 1 0 -1 0 1 2 3 log10 (Frequency [Hz]) 3 C 0 1 2 3 log10 (Frequency [Hz]) 3 D 4 H 2 1 0 -1 4 4 G 2 1 0 -1 4 4 F 3 4 Log10 (pA^2/Hz) 0 E 3 4 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 2 3 log10 (Frequency [Hz]) 3 0 Log10 (pA^2/Hz) 1 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 0 A 0 1 2 3 log10 (Frequency [Hz]) 4 Figure [4.4.3.9]. Power spectra of simulated data at SNR 5, with sinewave interference with SineFreq 60 Hz and different SineAmp, baseline corrected. A Simulated data with SineAmp 0.5. E SineAmp 1.0. B, F Data without sinewave. C, G Data in A and E baseline corrected with Viterbi-Kalman. D, H. Data in A and E baseline corrected with Forward-Backward-Kalman. Both VK and FBK completely eliminate the 60 Hz component. 147 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 3 2 1 0 -1 2 1 0 -1 1 2 3 log10 (Frequency [Hz]) 3 2 1 0 -1 1 2 3 log10 (Frequency [Hz]) 3 2 1 0 -1 0 1 2 3 log10 (Frequency [Hz]) 2 1 0 -1 0 B 1 2 3 log10 (Frequency [Hz]) 2 1 0 -1 0 1 2 3 log10 (Frequency [Hz]) 3 C 0 1 2 3 log10 (Frequency [Hz]) 3 D 4 H 2 1 0 -1 4 4 G 2 1 0 -1 4 4 F 3 4 Log10 (pA^2/Hz) 0 E 3 4 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 2 3 log10 (Frequency [Hz]) 3 0 Log10 (pA^2/Hz) 1 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 0 A 0 1 2 3 log10 (Frequency [Hz]) 4 Figure [4.4.3.10]. Power spectra of simulated data at SNR 5, with sinewave interference with SineFreq 600 Hz and different SineAmp, baseline corrected. A Simulated data with SineAmp 0.5. E SineAmp 1.0. B, F Data without sinewave. C, G Data in A and E baseline corrected with Viterbi-Kalman. D, H. Data in A and E baseline corrected with Forward-Backward-Kalman. Both VK and FBK completely eliminate the 600 Hz component. 148 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 3 2 1 0 -1 2 3 log10 (Frequency [Hz]) 3 2 1 0 -1 1 2 3 log10 (Frequency [Hz]) 3 2 1 0 -1 0 1 2 3 log10 (Frequency [Hz]) E 3 2 1 0 -1 4 0 B 1 2 3 log10 (Frequency [Hz]) 4 F 3 2 1 0 -1 4 Log10 (pA^2/Hz) 0 Log10 (pA^2/Hz) 1 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 0 A 0 1 2 3 log10 (Frequency [Hz]) 3 2 C 4 G 1 0 -1 4 0 1 2 3 log10 (Frequency [Hz]) 4 Figure [4.4.3.11]. Power spectra of simulated data at SNR 2, with sinewave interference with SineFreq 60 Hz and different SineAmp, baseline corrected. A Simulated data with SineAmp 0.5. D SineAmp 1.0. B, F Data without sinewave. C, G Data in A and E baseline corrected with Forward-Backward-Kalman. The 60 Hz component is completely eliminated. 149 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 3 2 1 0 -1 2 3 log10 (Frequency [Hz]) 3 2 1 0 -1 1 2 3 log10 (Frequency [Hz]) 3 2 1 0 -1 0 1 2 3 log10 (Frequency [Hz]) E 3 2 1 0 -1 4 0 B 1 2 3 log10 (Frequency [Hz]) 4 F 3 2 1 0 -1 4 Log10 (pA^2/Hz) 0 Log10 (pA^2/Hz) 1 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 0 A 0 1 2 3 log10 (Frequency [Hz]) 3 2 C 4 G 1 0 -1 4 0 1 2 3 log10 (Frequency [Hz]) 4 Figure [4.4.3.12]. Power spectra of simulated data at SNR 2, with sinewave interference with SineFreq 600 Hz and different SineAmp, baseline corrected. A Simulated data with SineAmp 0.5. E SineAmp 1.0. B, F Data without sinewave. C, G Data in A and E baseline corrected with Forward-Backward-Kalman. The 600 Hz component is very much reduced for SineAmp 0.5, but a significant residual remains for SineAmp 1.0. 150 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 3 2 1 0 2 3 log10 (Frequency [Hz]) 3 2 1 0 1 2 3 log10 (Frequency [Hz]) 3 2 1 0 0 1 2 3 log10 (Frequency [Hz]) E 3 2 1 0 4 0 B 1 2 3 log10 (Frequency [Hz]) 4 F 3 2 1 0 4 Log10 (pA^2/Hz) 0 Log10 (pA^2/Hz) 1 Log10 (pA^2/Hz) Log10 (pA^2/Hz) 0 A 0 1 2 3 log10 (Frequency [Hz]) G 3 C 4 2 1 0 4 0 1 2 3 log10 (Frequency [Hz]) 4 Figure [4.4.3.13]. Power spectra of simulated data at SNR 1, with sinewave interference with SineFreq 60 Hz and different SineAmp, baseline corrected. A Simulated data with SineAmp 0.5. D SineAmp 1.0. B, F Data without sinewave. C, G Data in A and E baseline corrected with Forward-Backward-Kalman. At this low SNR, the 60 Hz component is reduced but not eliminated. 151 A B 60 Hz 0.02 (Meas-True)/True (Meas-True)/True Viterbi-Kalman 0.01 0 Std0 -0.01 Amp1 Std1 -0.02 -0.03 -0.04 -0.05 0.5 Std0 Amp1 Std1 -0.06 -0.08 -0.1 -0.12 1 0 C 0.5 1 D 0.02 0.005 0 -0.005 -0.01 -0.015 -0.02 -0.025 -0.03 -0.035 -0.04 -0.045 (Meas-True)/True (Meas-True)/True 0 -0.02 -0.04 -0.14 0 Forward-Backward-Kalman 600 Hz Std0 Amp1 Std1 0 -0.02 Std0 Amp1 -0.04 Std1 -0.06 -0.08 -0.1 -0.12 0 0.5 0 1 Sine amp 0.5 1 Sine amp Figure [4.4.3.14]. Conductance parameter re-estimation from data simulated at SNR 5, with sinewave interference, at different SineFreq and SineAmp, baseline corrected and idealized. A and C Simulated data with SineFreq 60 Hz. B and D Simulated data at 600 Hz. A and B Data baseline corrected and idealized with Viterbi-Kalman. C and D Data baseline corrected and idealized with Forward-BackwardKalman. The amplitude is estimated accurately. For SineAmp 0.5, the noise is underestimated by less than 5% for SineFreq 60 Hz and by less than 10% for SineFreq 600 Hz. VK and FBK give nearly identical results. 152 B 60 Hz 0.004 0.002 0 -0.002 -0.004 -0.006 -0.008 -0.01 -0.012 -0.014 -0.016 Std0 Amp1 Std1 0.5 0.08 Std0 0.06 Amp1 0.04 Std1 0.02 0 -0.02 1 0 C 0.5 1 D 0.2 (Meas-True)/True 0.005 (Meas-True)/True 0.1 -0.04 0 SNR 1 600 Hz 0.12 (Meas-True)/True (Meas-True)/True SNR 2 A 0 -0.005 Std0 Amp1 -0.01 Std1 -0.015 -0.02 -0.025 0.15 Std0 Amp1 0.1 Std1 0.05 0 -0.05 -0.03 0 0.5 0 1 Sine amp 0.5 Sine amp Figure [4.4.3.15]. Conductance parameter re-estimation from data simulated at SNR 2 and 1, with sinewave interference, at different SineFreq and SineAmp, baseline corrected and idealized with Forward-Backward-Kalman. A and C Simulated data with SineFreq 60 Hz. B and D Simulated data at 600 Hz. A and B SNR 2. C and D SNR 1. The amplitude and the noise are estimated accurately, except for SineFreq 600 Hz, when correction fails. 153 A B C MIL MPL 10000 MIP 10000 10000 K12 60 Hz 1000 K21 100 1000 K21 K23 K23 K32 K32 K32 TrueK12 TrueK12 100 TrueK21 TrueK21 TrueK23 TrueK23 TrueK32 TrueK32 0.5 1 D 0.5 TrueK23 TrueK32 0 1 K21 100 K21 K23 K32 K32 TrueK12 TrueK12 100 0.5 Sine amp 1 1000 K21 K23 K32 TrueK12 100 TrueK21 TrueK21 TrueK23 TrueK23 TrueK23 TrueK32 TrueK32 10 0 K12 K12 1000 K23 TrueK21 10 1 10000 K12 1000 0.5 F 10000 10000 TrueK21 10 0 E TrueK12 100 10 0 K21 K23 10 600 Hz K12 K12 1000 0 0.5 Sine amp 1 TrueK32 10 0 0.5 1 Sine amp Figure [4.4.3.16]. Kinetic parameter estimation from data simulated at SNR 5 with sinewave interference. Data simulated with different SineFreq and SineAmp is baseline corrected and idealized with Viterbi-Kalman. The kinetics parameters are estimated with MIL and MIP from idealized data and with MPL from raw data. The thick lines are the parameters estimated from baseline corrected data, while thin lines are parameters estimated from true data, using the same procedures. For 60 Hz, the correction does not cause a significant alteration in the estimated kinetics. For 600 Hz, the correction changes significantly the kinetics for SineAmp 1.0. 154 A B C MIL MPL MIP 10000 10000 10000 60 Hz K12 1000 100 K23 K32 K32 K32 TrueK12 TrueK12 100 TrueK21 TrueK23 TrueK23 TrueK32 TrueK32 0.5 1 0.5 1 TrueK23 TrueK32 100 0.5 Sine amp 1 K12 K12 1000 K21 1000 K21 K23 K23 K32 K32 K32 TrueK12 TrueK12 100 TrueK21 TrueK23 TrueK23 TrueK32 TrueK32 TrueK12 100 TrueK21 TrueK23 TrueK32 10 10 0 1 K23 TrueK21 10 0.5 10000 K12 K21 0 F 10000 1000 TrueK21 10 0 E TrueK12 100 TrueK21 10 0 K21 K23 10000 600 Hz 1000 K21 K23 10 D K12 K12 1000 K21 0 0.5 Sine amp 1 0 0.5 1 Sine amp Figure [4.4.3.17]. Kinetic parameter estimation from data simulated at SNR 5 with sinewave interference. Data simulated with different SineFreq and SineAmp is baseline corrected and idealized with Forward-Backward-Kalman. The kinetics parameters are estimated with MIL and MIP from idealized data and with MPL from raw data. The thick lines are the parameters estimated from baseline corrected data, while thin lines are parameters estimated from true data, using the same procedures. For 60 Hz, the correction does not cause any visible change in the estimated kinetics. For 600 Hz, the correction changes the kinetics for SineAmp 1.0. 155 A B C MIL MPL 10000 MIP 10000 10000 60 Hz K12 1000 100 K23 K32 K32 K32 TrueK12 TrueK12 100 TrueK21 TrueK23 TrueK23 TrueK32 TrueK32 0.5 1 0.5 1 TrueK23 TrueK32 100 K12 1000 K21 0.5 Sine amp 1 K12 1000 K21 K23 K23 K32 K32 K32 TrueK12 TrueK12 100 TrueK21 TrueK23 TrueK23 TrueK32 TrueK32 10 0 1 K23 TrueK21 10 0.5 10000 K12 K21 0 F 10000 1000 TrueK21 10 0 E TrueK12 100 TrueK21 10 0 K21 K23 10000 600 Hz 1000 K21 K23 10 D K12 K12 1000 K21 TrueK12 100 TrueK21 TrueK23 TrueK32 10 0 0.5 Sine amp 1 0 0.5 1 Sine amp Figure [4.4.3.18]. Kinetic parameter estimation from data simulated at SNR 2 with sinewave interference. Data simulated with different SineFreq and SineAmp is baseline corrected and idealized with Forward-Backward-Kalman. The kinetics parameters are estimated with MIL and MIP from idealized data and with MPL from raw data. The thick lines are the parameters estimated from baseline corrected data, while thin lines are parameters estimated from true data, using the same procedures. For 60 Hz, the correction does not cause a significant change in the estimated kinetics. For 600 Hz, the correction fails for SineAmp > 0.5. 156 A B C MIL MPL MIP 10000 10000 10000 60 Hz 1000 K21 100 K23 K32 K32 K32 TrueK12 TrueK12 100 TrueK21 TrueK23 TrueK23 TrueK23 TrueK32 TrueK32 TrueK32 10 10 0.5 1 0 E 0.5 1 100 0.5 Sine amp K12 K12 1000 K21 1000 K21 K23 K23 K32 K32 K32 TrueK12 TrueK12 100 TrueK21 TrueK23 TrueK23 TrueK32 TrueK32 TrueK12 100 TrueK21 TrueK23 TrueK32 10 10 0 1 K23 TrueK21 10 0.5 10000 K12 K21 0 F 10000 1000 TrueK12 100 TrueK21 TrueK21 0 K21 K23 10000 600 Hz 1000 K21 K23 10 D K12 K12 K12 1000 0 0.5 Sine amp 0 0.5 Sine amp Figure [4.4.3.19]. Kinetic parameter estimation from data simulated at SNR 1 with sinewave interference. Data simulated with different SineFreq and SineAmp is baseline corrected and idealized with Forward-Backward-Kalman. The kinetic parameters are estimated with MIL and MIP from idealized data and with MPL from raw data. The thick lines are the parameters estimated from baseline corrected data, while thin lines are parameters estimated from true data, using the same procedures. For 60 Hz, the correction causes some change in the estimated kinetics, for SineAmp 1.0. For 600 Hz, the correction changes significantly the kinetics for SineAmp 0.5 and totally fails for SineAmp 1.0. 157 4.4.4 Idealization and baseline correction of experimental data After extensive testing with simulated data, the two baseline correction algorithms are applied to real data. Real data, with real baseline fluctuations, deviates from the ideal hidden Markov and linear Gaussian models in several ways. Non-stationarity is the first problem. Both the conductance and kinetics parameters may change in time for various reasons. Changing kinetics has a small effect on idealization but changing conductance levels has drastic consequences. The effects of non-stationarity can be alleviated by segmenting the data and running the algorithms on homogeneous individual segments. The second problem arises from non-ideal models. The most critical deviation appears to be the presence of correlation in the state noise. The correlation has two potential sources: data that is over-filtered (above the Nyquist sampling frequency), and intrinsic coloration arising from the physics of the channel. Additionally, the noise is not precisely Gaussian with occasional spikes, heavy tails etc. The linear Gaussian model description of baseline fluctuations is only phenomenological, while the real fluctuations may vary from patch to patch and not remain stationary inside a record. All of this results in non-optimal functioning of both the hidden Markov and linear Gaussian algorithms. On the other hand, the exquisite performance of the algorithm on sinewave interference suggests that the method is not sensitive to details of the noise model. I tested the algorithms on many different experimental data files. Here, I show results with one representative data set, provided by Dr. Asbed Keleshian, of the University at Buffalo. The data in Figure [4.4.4.1] is a single channel recording of Asbed??? channels. Visual inspection of the data provides clear evidence of conductance and kinetic non-stationarity. Thus, I have segmented the data into six traces, and each segment was idealized separately. The kinetics were estimated from all the segments together. The data requires at least two closed, but probably one open state. I therefore chose the C-O-C model, for idealization and kinetics estimation. The SNR is ≤ 5. 158 Figure [4.4.4.2] shows idealization obtained with Viterbi-Kalman and Figure [4.4.4.3] with Forward-Backward-Kalman. As expected at this SNR, both algorithms perform similarly, and make no obvious mistakes. Notice, however, the misclassification at the beginning of some traces, committed by both VK and FBK. The idealization was totally automatic, and the initial state probabilities were not manually initialized for each trace. The uncertainty at the beginning of a trace may result in misclassification, but the algorithms quickly recover once evident transitions are reached. This recovery is a very important property, guaranteeing that local mistakes do not propagate. This is satisfactory for equilibrium data, but suggests that manual intervention on starting values would be helpful for non-equilibrium data, i.e. data whose evolution depends upon the proper starting values. The non-stationarity in amplitude, kinetics and baseline is clear, in Figures [4.4.4.2] and [4.4.4.3], especially in trace B, of the baseline corrected data. The amplitude of the open conductance class decreases by about 30%, from ~3 pA to ~2.5 pA. I expected a BaseStd value of 0.01-0.02 to be optimal. Since it is relative to the difference in conductance amplitudes, I scaled up BaseStd, to a value of 0.04. Indeed, this value gave optimal results. Below this value, the correction was not complete, and above it over-correction occurred (results not shown). Rate constants were estimated with MIL, and are consistent whether the data was idealized with Viterbi-Kalman or with Forward-Backward-Kalman. 159 A 5 pA B 5 pA 20 ms 200 ms C 2 pA C D 10 ms 1 100.0 100.0 2 1000.0 1000.0 3 Figure [4.4.4.1]. ]. File 02n28042, raw data from ?? channels. Sampling 20 kHz, bandwidth ?? Hz. A Medium resolution view. B Low resolution view. C High resolution view. D Model used for idealization and baseline correction. 160 A 5 pA B 5 pA 200 ms 20 ms C 2 pA C 10 ms E D 1 350.25 315.12 2 761.83 6304.62 3 Figure [4.4.4.2]. Raw data of Figure [4.4.4.1], idealized and baseline corrected using Viterbi-Kalman, with BaseStd 0.04. Each trace was idealized separately to minimize errors due to non-stationarity. A Medium resolution view. B Low resolution view. C High resolution view. D Model optimized by MIL, dead time 0.08 ms. E Misplaced baseline at the beginning of trace. 161 A 5 pA B 5 pA 200 ms 20 ms E C 2 pA C 10 ms F D 1 317.31 280.98 2 621.48 5025.27 3 Figure [4.4.4.3]. Raw data of Figure [4.4.4.1], idealized and baseline corrected using Forward-Backward-Kalman, with BaseStd 0.04. Each trace was idealized separately. A Medium resolution view. B Low resolution view. C High resolution view. D Model optimized by MIL, dead time 0.08 ms. E, F Misplaced baseline at the beginning of trace. 162 4.5 Discussion Two new baseline correction algorithms, Viterbi-Kalman and Forward-BackwardKalman are tested with simulated and real single channel data. While optimal for stochastic baseline fluctuations generated by a simple linear Gaussian model, they also correct very well for sinewave interference and baseline artifacts found in real data. One might be tempted to re-estimate the BaseStd parameter from data, using the Kalman filter-smoother re-estimation formulae. That would eliminate the arbitrariness of choosing BaseStd. I have tested the re-estimation of BaseStd, with negative results (not shown). For artifacts simulated with the linear Gaussian model, BaseStd is in general re-estimated correctly, although in some cases it fails, returning a zero that prevents correction. Any deterministic artifact, such as drift or sinewave interference, causes the re-estimation to fail completely. Thus, the total variance (signal plus state noise plus baseline) is transferred into the baseline process variance, and the Markov signal is treated as noise or baseline fluctuations. Segmenting the data into small chunks will reduce the effect of the deterministic components. Currently, BaseStd must be an empirical choice. A BaseStd of 0.01-0.02, relative to a difference in conductance class amplitudes of 1, seems to be a good choice for most situations. BaseStd must be linearly scaled for other values of the conductance difference. The methods can also correct data containing substates (results not shown). Small values of BaseStd may not correct all baseline fluctuations, but they do not affect re-estimation of the conductance parameters. Large values of BaseStd will result in underestimated state noise, and possibly result in overcorrection when Markov state transitions are mistaken for abrupt changes in the baseline. Importantly, both methods are robust and recover quickly from local misclassifications, unlike some heuristic tracking algorithms. The larger the difference in noise between the closed and the 163 open states, the lower is the probability of misclassification. Also, longer data sets are less prone to errors than short ones since the Markov estimates are done globally. Further work is necessary to eliminate the misclassification errors that sometimes occur at the beginning of data records. If manual assignment of initial state probabilities is possible, such errors will not occur. One possible solution could be an initialization procedure: a short segment in the beginning is processed forward. The last point state probabilities are used as initial state probabilities for a reversed (right-to-left) forward step. The last step last point state probabilities can be used now by the regular algorithm. Forward-Backward-Kalman clearly has a much wider operational range than ViterbiKalman. The wider range comes at the price of computational complexity and speed. VK has the same complexity as standard Viterbi, with the addition of a simple Kalman filter step for each pair of Markov states, for each data point. Roughly, VK takes twice as long as Viterbi. FBK, on the other hand, takes much longer than Forward-Backward. In the initialization step, state inference at each data point is a calculation repeated until convergence. If the convergence is, on the average, reached in ten iterations, the initialization step is approximately equivalent with running the Forward procedure and the Kalman filter ten times each. Afterwards, a ForwardBackward procedure and a Kalman filter-smoother are alternated until convergence. If convergence is reached, for example, in ten iterations, the overall cost of the Forward-BackwardKalman algorithm is roughly 30 times that of a standard Forward-Backward and a Kalman filtersmoother taken together. It is necessary to choose the convergence criteria carefully to reach a useful compromise between accuracy and speed. Future work should investigate whether it is possible to incorporate in the ForwardBackward-Kalman algorithm correction for correlated noise. Instead of using meta-states and thus exponentially increasing the Markov model size, the noise correlation and filtering can be treated by extending the state of the linear Gaussian model, at possibly a much smaller computational 164 cost. The same expanded LGM state can be used for removal of sinewave interference with an extended Kalman filter. 165 5. Infinite state-space hidden Markov models. Application to single molecule fluorescence data from kinesin This chapter is devoted to the introduction of hidden Markov models to the study of staircase data from single molecules. The focus is on fluorescence measurements of kinesin molecular steps, prompted by collaborations with Dr. Paul Selvin of the University of Illinois. The molecular motor mechanism of kinesin is reviewed by Vale and Milligan (2000). Briefly, conventional kinesin is a highly processive molecular motor that is involved in membrane transport, mitosis and meiosis, messenger RNA and protein transport, ciliary and flagellar genesis, signal transduction, and microtubule polymer dynamics (Goldstein and Philp, 1999). A single molecule of kinesin can take several hundred steps along a microtubule without detaching (Howard et al, 1989; Block et al, 1990), stepping from one tubulin subunit to the next (a distance of 8 nm) (Svoboda et al, 1993; Visscher et al, 1999). The unidirectional movement is caused by a large conformational change in the “neck linker” region of kinesin (Rice et al, 1999). The neck linker is mobile when kinesin is bound to microtubules that are nucleotide-free or in ADP-bound states, but becomes docked to the catalytic core after binding an ATP. The energy released in ATP binding drives a forward motion of the neck linker (and of any object attached to its COOHterminus). Thus, when ATP binding causes the neck linker of the forward head to dock, the trailing head detaches from its binding site and is thrust forward to the next tubulin binding site. The force produced is enough to pull the attached cargo forward by 8 nm. Figure [5.1], from Vale and Milligan (2000), presents the succession of molecular steps. Recent techniques have made possible visualization of individual molecules of kinesin moving along the microtubule, with a resolution of a few nanometers (Selvin, 2002). Typically, a fluorophore is attached to kinesin (onto the head or body), and a CCD camera on a high magnification optical microscope records the position of the fluorescence signal. The 166 microtubules are fixed with respect to the microscope optical field. In off-line analysis, individual molecules are localized in the field, and their trajectory is traced from frame to frame. The distance between the present and the starting position of a molecule, projected along the tubulin axis, is plotted as a function of time. Since kinesin moves in discrete forward steps, the resulting data has a staircase appearance. Backward steps may occur occasionally as predicted for all reversible reactions. The duration of each step has a stochastic nature, with an exponential distribution. The magnitude of individual steps is generally assumed constant. Due to finite sampling, the molecule may jump two or more steps between data points. The presence of discrete levels in the data and their lifetime exponential distribution (Okada and Hirokawa, 1999) motivated me to attempt to analyze the kinesin data with hidden Markov models. However, direct application of standard Markov model algorithms to the kinesin staircase data is not quite straight forward. The problem resides in the definition of states. The kinesin molecule exists in a number of different conformations. Instantaneous transitions (at the time resolution of the experiment) are possible between some of these kinetic states. Occasionally, the molecule jumps to the next (or previous) position along the microtubule. The jump originates in one kinetic state and terminates in the same or a different kinetic state. At each position, the same set of states is available. Clearly, transitions between states without a position jump cannot be substantiated directly, although they may be inferred from the statistics of the state lifetimes. The kinetic states are aggregated into a single class. This is a major difference from ion channels, which have (by definition) at least two distinct conductance classes. Fortunately, the jump transitions can be observed. A description of the stepping process must include a set of kinetic states and a number of transitions, with and without position change. The occurrence of a step has to be somehow registered in the model. It is possible to fit small data sets with our standard single channel programs (Jeney et al, 2000). Each position is simply called a state. However, for larger data sets that require hundreds of states, that approach becomes too computationally intensive. 167 Given a kinetic scheme for kinesin reactions, several quantities need to be estimated. The amplitude of a single jump can be described, for example, by a Gaussian probability distribution with the standard deviation resulting primarily from instrumental errors in measuring the position. There may also be a contribution from the flexibility of kinesin itself. To complicate matters, however, the step distribution may have several peaks (a convolution of a multinomial with a Gaussian), since the substrate consists of repeated units, and the motor might reach over a number of units in one step. The step parameters are analogous to the conductance parameters of ion channels. As for ion channels, the rate constants governing state transitions (including those resulting in position change) must be estimated. An algorithm performing state inference and parameter re-estimation is necessary for analysis of data sets consisting of many jumps. I assign each data point probabilistically to a discrete level, and that level is located a known number of jumps from the starting position. With level assignment, the average amplitude of each level is calculated. The averages are further processed to determine the single jump distribution (assuming a unimodal distribution). These levels, corresponding to different positions along the microtubule, must be treated as separate discrete Markov states, or groups of aggregated states. From the inferred state sequence, Markov model-based algorithms can, in principle, obtain maximum likelihood estimates of the transition rates. Difficulties occur for a number of reasons. Since the labeling fluorophore has a limited lifetime, the duration of a fluorescence record is finite. Nonetheless, finite does not equate to small. The number of jumps occurring during the record is not known a priori. In the present work, I model kinesin with an aggregated hidden Markov model with an infinite number of states. The infinite-state model is truncated to a finite size having core states with certain properties described in the following section. I use the truncated model to idealize the data with a modified Forward-Backward procedure, similar to the Forward-Backward-Kalman algorithm presented in Chapter 4. The idealization is used to determine the parameters of a single 168 jump. The kinetics parameters are also estimated from the idealization, using a modified version of the MIP algorithm presented in Chapter 3. A B C D Figure [5.1]. Motility cycles of conventional kinesin. A Each catalytic core (blue) is bound to a tubulin heterodimer, along a microtubule protofilament. ATP binding to the leading head initiates neck linker (red) docking. B Neck linker docking is completed by the leading head (yellow), which throws the trailing head forward by 16 nm (arrow) toward the next tubulin binding site. C After a random diffusional search, the new leading head docks tightly onto the binding site, and releases ADP. The trailing head hydrolyzes ATP to ADP-Pi. D After ADP dissociates, an ATP binds to the leading head and the neck linker begins to zipper onto the core (orange). The trailing head has released its Pi and detached its neck linker (red) from the core and is in the process of being thrown forward. Adapted from Vale and Milligan, 2000. 169 5.1 Infinite state-space hidden Markov model The kinesin molecule is assumed to have NK kinetic states, the set of which represents an aggregation class. After each step (including the steps missed due to finite sampling), the molecule enters a new aggregation class. A data set with NJ jumps requires NC = NJ + 1 classes, and a total of NC*NK states, which differ either by the kinetic index k, or by the class (position) index c. The resulting Markov model has a Q matrix of dimension (NC*NK) x (NC*NK). The Q matrix has some interesting properties. The kinetic states in class c are directly connected only to kinetic states in class c+1 (forward movement), and possibly to kinetic states in class c-1 (unlikely, backward movement). An equivalent kinetic scheme is that of a chain with identical units. A unit corresponds to an aggregation class, and is connected only to its right, and possibly, left neighbor. The reaction flow along the unit chain can proceed left-to-right, or it can be bidirectional. The chain can be, in principle, considered infinite in length. If the complete state (kinetic state and class) at time t is known, the position at t+1 is expected to be in the neighborhood. Within a sampling interval t, the molecule can take any number of steps, back and forth. The probability of finding the kinesin at a further and further position decreases (multi-) exponentially. If the sampling interval is relatively fast compared to the kinetics, the probability becomes practically zero after a small number of steps. In other words, at time t the probability mass function (pmf) is concentrated in the kinetic states of class c. After one sampling interval, the pmf would have spread into the neighborhood of c, i.e. c+1, c+2 etc. Any position would have a finite occupancy probability, but it would be practically zero for positions sufficiently far from c. One cannot help from making the analogy with the passing of an impulse signal through a filter. After passing through the filter, the signal is spread. Although a theoretically infinite number of digital FIR coefficients would be necessary to describe the response, a finite number 170 gives adequate precision. The role of the filter coefficients is played here by the transition probability matrix A. The impulse signal represents the state probabilities vector at time t and the response is the state probabilities vector at time t+1 (the T symbol denotes transposition): Pt 1  Pt  A T T [5.1.1] A truncated, finite size Atr matrix can replace the infinite A matrix. The truncated Atr is, obviously, obtained from a truncated Qtr: Atr  eQ tr t [5.1.2] The truncated Qtr is built from a small, finite number of units, and it is block diagonal. The diagonality results from the connectivity: only adjacent units are directly connected (uni- or bidirectionally). The transitions between kinetic states insides a unit c, as well as jump transitions from unit c to c+1 are the same for any c. Therefore, the blocks of the Qtr matrix, from the topleft to the bottom-right corner, are identical: no-jump: qic, jc  qic k , jc k forward jump: qic, jc1  qic k , jc k 1 qic, jc k  0 k  k  i, j , c, k  2 [5.1.3] [5.1.4] [5.1.5] The effects of truncation on the Atr matrix should be minimal in the center block, if the process is bidirectional, or in the top-left corner, if the process is unidirectional. Therefore, if r is the truncation order, 2r+1 or r units are necessary, depending on reversibility. 171 Figures [5.1.1] and [5.1.2] show an example of a unidirectional chain, and the calculated Qtr and Atr matrices, of different truncation orders. Only four units are represented, but the chain is potentially infinite, to the left and to the right. There is only one kinetic state per unit. The forward rate is 0.25 s-1. In Figure [5.1.1], the Qtr is formed by a virtual “copy” operation, from the Q matrix of a larger (infinite) model. By definition (see Chapter 2), the sum of each row of a Q matrix must be zero. When formed by “copying” from a larger model, the resulting Qtr matrix no longer has this property. Consequently, the truncated Atr matrix is not row stochastic, but has the property: aijtr  akltr if  j  i   l  k  . [5.1.3] In Figure [5.1.2], the Qtr is formed as the Q matrix of a finite model. For example, the Qtr of truncation order 4 is the actual Q matrix of the model in Figure [5.1.1], A. The sum of each row remains zero (notice the last row having all elements zero). Consequently, the truncated Atr matrix is row stochastic. Although it no longer has the property [5.1.3], it has a good approximation of it. More details about infinite-state Markov models are available, for example, in (Kemeny et al, 1966). 172 A 1 0.25 0.0 2 0.25 0.0 3 0.25 0.0 4 B 0. 0.  -.25 .25     0.   0. -.25 .25  Q :=   0. -.25 .25   0.     0. 0. -.25  0. .882497 .110312 .006895 .000287     .882497 .110312 .006895  0.  A :=   0. .882497 .110312  0.     0. 0. .882497  0. C 0. 0. 0. 0.  -.25 .25     0. 0. 0.   0. -.25 .25     0. -.25 .25 0. 0.   0.  Q :=   0. 0. -.25 .25 0.   0.     0. 0. 0. -.25 .25   0.     0. 0. 0. 0. -.25  0. -5  .882497 .110312 .006895 .000287 .8977 10    0. .882497 .110312 .006895 .000287    0. 0. .882497 .110312 .006895 A :=   0. 0. 0. .882497 .110312    0. 0. 0. 0. .882497   0. 0. 0. 0.  0. -6 .2244 10    -5 .8977 10   .000287   .006895   .110312   .882497  Figure [5.1.1]. A unidirectional chain, with one kinetic state. A Model with four units. B Qtr and Atr of truncation order 4. C Qtr and Atr of truncation order 6. Sampling time 0.5 s. The Qtr matrices are “copied” from an infinite model. The Atr matrices are calculated with Maple, and are not row-stochastic (the sum of each row is not 1), but have property [5.1.3]. 173 A 1 0.25 0.0 2 0.25 0.0 3 0.25 0.0 4 B 0. 0.  -.25 .25      0. -.25 .25 0.   Q :=   0. -.25 .25  0.     0. 0. 0.   0. .882497 .110312 .006895 .000296     .882497 .110312 .007191  0.  A :=   0. .882497 .117503  0.     0. 0. 1   0. C 0. 0. 0. 0.  -.25 .25     0. 0. 0.   0. -.25 .25     0. -.25 .25 0. 0.   0.  Q :=   0. 0. -.25 .25 0.   0.     0. 0. 0. -.25 .25  0.     0. 0. 0. 0. 0.   0. -5  .882497 .110312 .006895 .000287 .8977 10    0. .882497 .110312 .006895 .000287    0. 0. .882497 .110312 .006895 A :=   0. 0. 0. .882497 .110312    0. 0. 0. 0. .882497   0. 0. 0. 0.  0. -6 .2292 10    -5 .9206 10   .000296   .007191   .117503   1.  Figure [5.1.2]. A uni-directional chain, with one kinetic state . A Model with four units. B Qtr and Atr of truncation order 4. C Qtr and Atr of truncation order 6. Sampling time 0.5 s. The Qtr matrices are of a finite model. Notice the last row is all zero. The Atr matrices are calculated with Maple. As opposed to figure [5.1.1], the Atr matrices are row-stochastic (the sum of each row is 1), and do not have property [5.1.3]. 174 5.2 Kinesin stepping model A fluorophore can be attached to different parts of the kinesin molecule, resulting in different characteristics of the observed signal (Selvin, 2002). A fluorophore positioned on the body results in identical jumps and in apparent kinetics appropriate to organelle transport rates. A fluorophore positioned on the foot results in jumps that are twice as large and twice as slow. A fluorophore attached on the neck results in different size jumps, with the sum of two consecutive steps equal to twice the foot-attached amplitude, and kinetics matching organelle transport. The present work focuses on data produced with a fluorophore attached to the body, but the methods can be extended to other experiments. I model the measured step size as a stochastic quantity, with a Gaussian probability distribution. I refer to the mean and standard deviation of the distribution as JumpAmp and JumpStd, respectively. This distribution incorporates intrinsic variability in the physical position of the probe as well as measurement errors associated with finding the centroid of the observed intensity distribution. Counting fluorescence photons is a Poisson process, but if the number of photons contributing to one measurement is large enough, the Poisson will be approximated by a Gaussian. I will refer to the standard deviation of the measurement as MeasStd. I consider MeasStd to be constant in time and position. 5.3 Algorithms for staircase data I divide the staircase data analysis into two parts. The first part idealizes the data and reestimates jump distance parameters. The second part estimates the kinetics from the idealized data. The next sections describe the two algorithms, one for idealization and the other for kinetics. 175 5.3.1 Position estimation – data idealization The idealization is a general procedure by which the useful signal is extracted from noise. The Viterbi algorithm (Viterbi, 1967; Forney, 1973) and the Forward-Backward (Baum et al, 1970) procedure idealize data produced by a hidden Markov model, when the parameters  are known. For unknown parameters, the usual strategy is to run an Expectation-Maximization algorithm (Dempster et al, 1977), such as Baum-Welch (Baum et al, 1970) or Segmental k-means (Rabiner et al, 1986). An expectation step (Forward-Backward or Viterbi) performs state inference and computes sufficient statistics, given current parameters. A maximization step follows, re-estimating parameters from sufficient statistics. The two steps are iterated until a convergence criterion is reached. An Expectation-Maximization algorithm is guaranteed to converge to a local maximum of the likelihood function (Baum et al, 1970; Dempster et al, 1977). For ion channels, the goal of idealization is to classify each data point to a conductance class, with maximum probability. For kinesin staircase data, each point is similarly assigned to a step position, or level, where each level corresponds to a defined position along the microtubule. Thus, idealization eliminates the position uncertainty, but for each position the kinetic state of kinesin is still undetermined, since aggregated states cannot be distinguished experimentally. Consequently, for the purpose of kinetics, the likelihood function has to be marginalized over all possible state sequences. Several reasons prevent a direct application of a standard Markov state inference algorithm to staircase data. The number of jumps in a record and, consequently, the number of Markov states/classes, are not known a priori. The magnitude of a single jump is not known, and it is not necessarily constant but may probabilistically differ from step to step. The variability of the jump is either intrinsic to the physical process or it is an experimental artifact. Data may be 176 corrupted by steady, slow drift, caused by movement of the microscope stage relative to the CCD camera. For this reason, even for exactly known jump amplitudes, the number of levels present in data still cannot be calculated from the excursion of the signal between a minimum and a maximum value. Due to the probabilistic nature of the jump and to the possible drift, the same number of levels can result in different signal excursions. A possible solution is to consider several values for the level count, idealize data for each, and choose the number giving the most likely idealization, according to some criterion. However, this method requires a possibly very large A matrix, and must be repeated several times. A better solution yet is to devise a tracking procedure with dynamically adjustable reference, for detection of jumps, or, equivalently, levels. This is the procedure I propose next. In principle, a jump detector must use a reference level. Inaccuracy in the current JumpAmp estimate, mainly, and baseline drift render a detection procedure with static reference ineffective. As tracking progresses, errors accumulate and false negative or false positive mistakes in level detection are made. If, on the other hand, the reference level is adjusted dynamically during tracking, the cumulative error is minimized. Furthermore, the reference level can be used as the center unit of a truncated model, as described in the previous section. The small, truncated model is translated along the data, and moved to the right (or to the left) wherever a jump is detected. In effect, the algorithm cuts a subset of likely paths from the much larger (infinite) set of possible paths. An auxiliary quantity can simply register, for each data point, the offset of this truncated model. I implemented this approach in a procedure similar to the Forward-Backward-Kalman algorithm from Chapter 4, with the differences of using a truncated model and an offset variable. I also tried a Viterbi-Kalman-like variant, which turned out to be unsatisfactory for the signal-tonoise ratio of the available experimental data. The procedure can be added to a Baum-Welch-type algorithm, with appropriate modification of the re-estimation formulae. The next two sections describe the new state inference procedure and the re-estimation. 177 5.3.1.1 Modified Forward-Backward-Kalman procedure for staircase data Without loss of generality, I describe the procedure as it applies to a model with one kinetic state per unit. Computation of sufficient statistics in the infinite state space is neither possible nor necessary. Due to the properties discussed before, calculations can be done with a truncated model. To accomplish this, a mapping between the states of the truncated model and the states of the infinite model is needed. The index of the infinite model states is an integer number, taking values between minus and plus infinity. The index of the truncated model states takes values between 1 and 2r+1, where r is the truncation order (for more than one state per unit, the state vector has 2rn+1 entries, where n is the number of states per unit). By definition, the infinite model state with index zero corresponds to the starting level. I call this the reference state (RefState). Initially, the middle state in the truncated model, with index (r+1), is mapped into RefState. As the state inference procedure progresses forward, the mapping changes as new levels are detected. The mapping is realized with an additional quantity,  (offset), which is a vector of the same length as the data. For each data point, t indicates which state in the infinite model the middle state of the truncated model is mapped. At the same time, t is chosen such as to make the middle truncated state be the most likely filtered estimate, that which has the maximum entry in the truncated  vector. The combination of the state index in the truncated model, i and the  value maps into a state index, l in the infinite model: i tr    l 178 [5.3.1.1.1] The  of the first point, 0 is initialized to -(r+1) and thus the middle truncated state, of index equal to (r+1), points to RefState: 0  r  1 [5.3.1.1.2] r  1  0  0 [5.3.1.1.3] The state vectors (,  and ) are truncated to a 2r+1 size. They can be interpreted as potentially infinite in size but having only 2r+1 non-zero elements, for any t. The entries of the initial probabilities vector  (of the same 2r+1 size) are all zero, except for the middle state (or states, if more than one per unit) which is one:  i  1, i  r  1  j  0, j  r  1 [5.3.1.1.4] The amplitude of the RefState level is equal to BaseAmp. The other levels’ amplitudes are calculated relative to RefState and BaseAmp: Ampl  BaseAmp  l  JumpAmp [5.3.1.1.5] BaseAmp is a parameter of the idealization and can be initialized, for example, by manual selection. MeasStd can be initialized in the same way. The emission probabilities for any level are calculated as: b l  y t   N  Ampl , MeasStd  | yt 179 [5.3.1.1.6] The ’s are initialized:  0i   i  bi  y0  0 [5.3.1.1.7] and they are all zero, except for i = r+1. For any 0<t≤T, the recurrence of  is: 2 r 1   i1   t j1    ti  aij   b j  t 1  yt 1  [5.3.1.1.8] The effects of truncation on the A matrix can be minimized by choosing the appropriate entries (the left-top corner or the middle, for left-right or reversible models, respectively): left-right: aij  a1tr,1( j i ) reversible: aij  artr1,r 1( j i ) [5.3.1.1.9] [5.3.1.1.10] The truncated model must be centered on the current level. Thus the index of the maximum element of  is searched for. If the index of the maximum is not that of the middle state (r+1), then o is adjusted to make it so.  is then re-calculated, and this is repeated until the condition is fulfilled. A backward step follows, which calculates  and . Certainly, there is no guarantee that the same state that maximized  will maximize  too. However, it is very likely that the maximum will at least be within the truncated state-space. The state which maximizes  is used for level assignment in order to obtain the idealization: 180   Levelt  arg max  ti   t [5.3.1.1.11] i Unfortunately, the matter is complicated by imprecise knowledge of BaseAmp and, more importantly, JumpAmp. Consequently, even in the absence of actual baseline drift, each new level detected in the data determines an increase in the prediction error of that level’s amplitude. After a number of steps, the state inference and level assignment fail. A simple yet effective solution to this problem is to assimilate the uncertainty in BaseAmp and JumpAmp to a virtual baseline drift. This drift is modeled with a linear Gaussian process, as in equation [4.2.6], used to describe baseline artifacts in single channel data. The baseline offset at each point is estimated with a Kalman filter. The data value at each point is corrected by subtracting the estimated baseline from it: yt : yt  bt [5.3.1.1.12] The net result of this procedure is that the level prediction error is kept under control. Since the so-estimated baseline does not correspond to a real drift but it is rather a computational trick, it should be ignored for the purpose of re-estimation. Presence of true baseline drift is a different matter, which I do not address here. The resulting algorithm is equivalent to the initialization step of Forward-BackwardKalman, presented in Chapter 4. Obviously, the expectation-maximization steps may be run as well, but preliminary results (not shown) did not indicate a significant improvement. Further work and more experimental data are necessary in order to clarify this issue. Another modification to the Forward-Backward-Kalman algorithm resulted in dramatic improvement. Its purpose was to accelerate adaptation to a new level, but to do so only in the vicinity of detected jumps. The modification is simple. The amplitude of a jump is modeled as a 181 random Gaussian variable, with JumpAmp mean and JumpVar variance. Thus, if a jump is detected between the previous and the current points, the variance of a jump, JumpVar must be added to the baseline variance. If not one, but k levels are jumped between two consecutive points, then k times JumpVar is added. The corrected BaseVar is easily calculated with help from the  variables.  BaseVar : BaseVar  JumpVar    ti   t j1   j  i   k v i  [5.3.1.1.13] j Clearly, when the state maximizing (with high score) t+1 is equal to the state maximizing (with high score) t; no variance is added, as no jump is indicated. The absolute index does not matter. If the difference between the two maximizing states is +/-1, then one jump variance is added, and so on. kv is an empirical factor, which preliminary tests indicated to be necessary. The best results (not shown) were obtained with kv = 0.1. 5.3.1.2 Re-estimation The modified Forward-Backward-Kalman described in the previous section computes the most likely sequence of states, given a guessed set of parameters. The parameters must now be reestimated from the inferred state sequence. The re-estimation must update BaseAmp, JumpAmp, JumpStd, MeasStd, and, optionally, the Atr matrix. The first step of re-estimation is to find the number of levels present in the data. Two auxiliary quantities are calculated, MinLevel and MaxLevel. They represent the lowest and the highest levels in the data. MinLevel could actually be higher or lower than zero. For each of these levels, the mean amplitude, Ampl, is calculated: 182 T Ampl  y t 0 T t   tl t  t 0 [5.3.1.2.1] l t t It is clear from the expression above that the baseline does not intervene in re-estimation. Of course, if a data point is not in the truncated space centered on level l, it is excluded from calculation of the corresponding amplitude. In other words, not being in the truncated space implies a  equal to zero. BaseAmp is re-estimated as the amplitude of the zero level: BaseAmp  Amp0 [5.3.1.2.2] Once the level amplitudes are known, the standard deviation of the measurement noise, MeasStd is updated:  T MaxLevel l t 2      t   yt  Ampl    MeasStd   t 0 l MinLevel T MaxLevel   l t t     t 0 l  MinLevel    1 2 [5.3.1.2.3] The re-estimation of JumpAmp is possible in several ways. For example, JumpAmp can be calculated as the average difference between any two consecutive levels. Alternatively, JumpAmp can be calculated from the difference in amplitude of any two levels. These quantities are evidently divided by the difference in level. I tested both methods but, as it gives much better results, I write expressions for the second method only: 183  Ampl  Ampk  JumpAmp  Average  l k   k , l | k  l  [5.3.1.2.4] The averaging in [5.3.1.2.4] must be somehow weighted, so that less represented levels (or possibly not represented at all) get correspondingly less weight and vice-versa: better represented levels get more. The weighting factor of a level is inversely proportional to its variance, which in turn is inversely proportional to the number of points assigned to that level. The point assignment is given by . The weighting factor of a difference of levels is inversely proportional to the sum of the two levels’ variances. Thus, using the weighting factors wkl, JumpAmp is calculated as: MaxLevel MaxLevel JumpAmp     Ampl  Ampk  wkl   lk   k  MinLevel l  k 1 MaxLevel MaxLevel  w k  MinLevel l  k 1 wkl    k t t t kl    tl t t [5.3.1.2.6]    tl t k  t t [5.3.1.2.5] t t The standard deviation of the jump, JumpStd, must, however, be calculated from the differences in consecutive levels amplitudes: MaxLevel 1 JumpStd   w  Amp l , l 1 l  MinLevel  Ampl   JumpAmp 2 l 1 MaxLevel 1 w l , l 1 l  MinLevel 184 [5.3.1.2.7] The re-estimation of the truncated Atr matrix is slightly different from the standard procedure. tij is the conditional probability that the state at time t is i and the state at time t+1 is j, given the whole observation sequence (see expression [2.3.1.1.12]). I calculate  in a way that satisfies equation [5.1.3]:  tkl  2 r 1  i t  aij  b j  yt 1    t j 1 i 1 , j 1 1i t  2 r 1 1 j  t 1  2 r 1  j i  t 1 t l k  [5.3.1.2.8] k , l | 1  k  2r  1,1  l  2r  1 t | 0  t  T  The Atr is re-estimated in the usual way and, due to normalization, the property [5.1.3] may be lost. 5.3.2 Kinetic parameters estimation I introduce here a maximum likelihood method which estimates kinetic parameters from staircase data, using truncations of infinite-state hidden Markov models. It is, in principle, identical to the MIP algorithm, with certain modifications in the likelihood function. The method is applied to kinesin single molecule fluorescence data. The algorithm uses the idealization obtained with the modified Forward-Backward-Kalman procedure. It estimates the rate constants of conformational transitions and the rate constants of the jumps between positions. The algorithm provides standard deviations of the kinetic estimates and is capable of fitting multiple data sets across different experimental conditions. As with MIP, it permits the user to impose linear constraints on the estimated rate constants. I subsequently refer to this algorithm as MIPS (Maximum Idealized Point likelihood of Staircase data). 185 At each position, the kinesin molecule may exist in several kinetic states, connected by transition rates. The jump from one position to the next (or previous) may originate, and may end, in one or several kinetic states. The jump rates corresponding to different combinations of initial and final states may be identical or different. Since the kinetic scheme is identical at each position, a description of the kinetic model requires the conformational kinetic scheme, at one position, and the jumping scheme. A minimal representation is possible with only two units. Figure [5.3.2.1] gives an example of a conformational kinetic scheme with two states. The jump between two consecutive positions can only take place from state B to state A. The kinetic parameters are k1, k2, j1 and j2. xt k1 A B k2 j2 j1 xt+1 k1 A B k2 j2 j1 Figure [5.3.2.1]. Kinesin kinetic model, with two states per position. Two units are shown, as part of a larger (infinite) chain. The squares A and B represent the kinetic states within the same unit, connected by transition rates k1 and k2, identical for any unit. State B in one unit is connected to state A in the next unit by jump rates j1 and j2. The model has four parameters: k1, k2, j1 and j2. 5.3.2.1 Parameterization and constraints The parameterization and the constraints take the same form as presented in Chapter 3, and are not repeated here. 186 5.3.2.2 Likelihood function First, the idealized data is differentiated. The resulting data points are 0, if there is no jump, or k, if there is a k-order jump between two consecutive observations (or –k if the jump is downwards, for reversible models). In the following, for easier understanding and without loss of generality, I assume that the maximum jump detected in the data is of order +/-1 (no multipleposition jumps). I also suppose the model is reversible, and that the Qtr and Atr matrices have a truncation order r = 1. Therefore, the truncated model has only three units (left, middle and right). All the matrices and vectors used below have three blocks, corresponding to the three units. The likelihood of a sequence of idealized staircase data points, in matrix form, is:   L  P0  B0  S  A  B1  S  A  B2  S  A  B3  SA  BT  S  1 T [5.3.2.1.1] The likelihood expression is quasi-identical to that in [3.1.2.1], used by MIP. I discussed there the interpretation of the function and the significance of each term, thus I point here only the differences. P0T is the (transposed) initial state probabilities vector. All its entries are zero except those corresponding to the middle unit, which are non-zero and sum to one. Bt is a diagonal matrix, whose diagonal entries, block-wise, are either 0 or 1. For points corresponding to a jump up (differentiated idealized data is 1), the left block entries are 1 and all the others are zero. For points corresponding to a jump down (differentiated idealized data is -1), the right block entries are 1 and all the others are zero. For points corresponding to no jump, the middle entries are 1 and the others 0. At each point t, the appropriate Bt matrix is plugged into the likelihood expression. The B and S post-multiplying P0 can actually be removed from the expression, as the first data point is always 0 by definition. 187 S is a matrix that sums, block-wise, the state probabilities, moves the sum into the middle block and clears the left and right blocks. Its purpose is to reset the state probabilities vector. A brief explanation is in order. The likelihood calculation starts with the initial state probabilities P0, which is the a priori knowledge of the initial state occupancies. Then, for each time t, Pt+1 is first predicted by post-multiplying Pt with the truncated Atr matrix. The prediction is then updated using the evidence (the data), by post-multiplication with the diagonal matrix Bt+1. The updated Pt+1 is the a posteriori estimate. The entries in the state vector corresponding to the states for which there is no evidence in the data are zeroed. The updated state probabilities vector must now be reset, so that only the middle block states are occupied and the others are zero. All that matters is the spread of probabilities between two data points, with respect to their relative levels. The absolute level of the current or of the next point is irrelevant. For example, if from time t to time t+1 there is a jump from level 13 to level 14, only a relative jump of +1 is registered. This is achieved by the S matrix. Of course, computationally the same result can be achieved without the S matrix, but this trick permits a consistent matrix form of the likelihood function and of its derivatives, as indicated below. Finally, the state probability vector at the last point is postmultiplied with a column vector of ones, which sums the probabilities of each state. This sum is equal to the likelihood. Obviously, the more the prediction matches the evidence, the higher the value of the likelihood. The structure of the resetting matrix S for a reversible model of truncation order r=1, with two kinetic states per position is shown in Figure [5.3.2.1.1]. The function of the S matrix is immediately obvious from the figure. The truncation order of the model should be at least equal to the order of the maximum jump detected in the actual data. For example, if the maximum jump is 3, then the model has 2*3+1 = 7 units, and all matrices are sized accordingly. If the model is left-to-right, then the left units can be dropped, since there is never any evidence in the data for backwards steps. Thus, 3+1=4 units are enough. Other problems posed by irreversible schemes are discussed in later sections. 188 One must notice that in the computation of the likelihood function, the state probabilities decrease rapidly, leading to numerical underflow. A way to prevent this is by re-normalizing the state vector at each point and using the normalized vector for computation of the next point. The likelihood is calculated as the product of the norms for each point. Naturally, since this quantity will also be extremely small, the logarithm of the likelihood is calculated instead, as the sum of the logarithms of the norms. A V := [.01 .03 .2 .11 .3 .25] B  0.    0.    0. S :=   0.    0.    0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0.    0.    0.    0.    0.    0.  C V_S := [0. 0 .51 .39 0 0] Figure [5.3.2.1.1]. The function of the resetting matrix S, for a reversible model of truncation order r=1. A State vector before multiplication by S. B S matrix. C State vector after post-multiplication with S. Block-wise, the state probabilities are summed, moved to the center block (yellow) and the left and right blocks (gray) are cleared. 5.3.2.3 Computation and derivatives of the likelihood function The computation and the derivatives of the likelihood function are similar to those described in Chapter 3, with the difference of introducing the S matrix. I omit the time dependency of the A matrix, and present only the final results. 189 The recursive forward and backward variables  and  are:   αt  P0  B0  S  A  B1  S  A  B2  S  A  B3  SA  Bt  S T T [5.3.2.3.1] α t 1  αt  A  Bt 1  S [5.3.2.3.2] α 0  P0  B0  S [5.3.2.3.3] βt  A  Bt 1  S A  BT  S   1 [5.3.2.3.4] βt  A  Bt 1  S  βt 1 [5.3.2.3.5] βT  1 [5.3.2.3.6] T T T T The log-likelihood LL is calculated from the normalization factors: T LL   log st [5.3.2.3.7] t 0 The derivatives of the log-likelihood are: T  T 1   T A tr    log L P0   B 0  S   β 0   α t   Bt 1  S  βt 1  x x x t 0   [5.3.2.3.8] The details of the calculation are the same as in Chapter 3 and are not presented here. 5.3.2.4 Speeding up the computation 190 I use the same technique to speed-up the computation as described in Chapter 3, by the use of pre-multiplications. 5.4 Results I tested the algorithm with simulated and experimental data. The experimental recordings are not actually produced by kinesin experiments, but utilize the same optical equipment and generate the steps by stage translation controlled by a computer (Selvin, 2002). In order to simulate staircase data, I modified the single channel simulator within QuB (www.QuB.buffalo.edu). The staircase simulator is turned on as a new option in QuB. The user needs to specify two (for irreversible) or three units (for reversible models). Each time a transition occurs between states in different units, the current level is incremented by +/-JumpAmp, depending on whether the jump is upward or downward. Then the state vector is reset back into the middle unit. Of course, several jumps may, probabilistically, occur between two simulated points. I have used both irreversible (left-right only) and reversible models. The irreversible models have the right-left jump rates equal to zero. The experimental data files I had available for analysis (Selvin, 2002) are sampled at 500ms, and are between 46 and 176 (46, 47, 51.5, 55.5, 96.5, 100.5, 109.5, 176) seconds long. Therefore, I have simulated data with the same sampling frequency, and duration of 90 seconds, trying to be roughly in the same signal-to-noise range. I start the results section by testing the kinetics estimation with simulated data. With the kinetics performance so established, I test the idealization procedure. I idealize both experimental and simulated data. The idealization of the simulated data is compared against the true idealization. The kinetics estimated from idealization is compared against the kinetics of the true idealization. The idealization of experimental data is 191 visually inspected. Both the kinetics estimation and the data idealization algorithms are implemented as features of the QuB software (www.QuB.buffalo.edu). 5.4.1 Kinetics estimation of simulated data The kinetics estimation was tested mostly with simulated data produced by a simple irreversible model, with one kinetic state per position. I checked the following characteristics of the algorithm: bias of the estimate, standard error of the estimate, effect of Atr matrix truncation order, effect of jump truncation and effect of return rate. If the algorithm is not biased, then the estimated rates should equal the true values when the amount of data is infinite. The algorithm provides a standard deviation of the estimate, which should be interpreted as the degree to which the estimate can be trusted, rather than as how close the estimate is to the true values. It is of course desirable that the standard error of the estimate decreases rapidly with the amount of data. The model of the process is, in principle, infinite. It must, however, be truncated to a finite size. The truncated Atr matrix introduces some errors, which can be lessened with a larger truncation order. Reducing truncation results in computational problems, as larger matrices are more difficult to handle. The truncation order must be chosen to be at least equal to the magnitude of the highest occurring jump, higher if possible. If, for some reason, the order must be smaller than the highest jump, then differentiated idealized data must be truncated to that order. The jump truncation is therefore another source of errors, which can be avoided by faster sampling at higher bandwidth (however, higher bandwidth results in higher noise, which may be even less desirable). I refer to the model truncation order as TruncOrder, and to the jump truncation order as MaxJump. Both are parameters of the optimization algorithm. The effect of the return rate must be evaluated because of an unfortunate limitation of the matrix exponentiation procedure. The spectral expansion method (Colquhoun and Sigworth, 1995a) requires that the eigenvalues of the Q matrix be distinct, which is almost always the case 192 for ion channels. The irreversible models used for kinesin data have confluent (multiple) eigenvalues, though. It would be too complicated to devise a different exponentiation technique, so I approximated the return rates with a small finite value, rendering the model very slowly reversible. The return rates are chosen such as to give distinct eigenvalues, yet to cause minimal errors in the estimated rates. Obviously, if the model is known to be irreversible, these rates are constrained to be fixed and equal to a small constant. For example, the Qtr matrix in Figure [5.1.2], B, gives three confluent eigenvalues of -0.25, and one zero. By making the backward rate 0.001s-1, the three non-zero eigenvalues are ~(-0.2734, -0.2510, -0.2286). With a backward rate of 0.0001 s-1, the three non-zero eigenvalues are ~(-0.2572, -0.2501, -0.2430), close enough to the confluent eigenvalues yet distinct. For an irreversible model, the only parameters to be optimized are the forward rates. They are subject to linear constraints: the correspondent transition and jump rates in and between different units are identical. For example, for the model in Figure [5.3.2.1], truncated to four units, there are only four free parameters: k1, k2, j1 and j2, if reversible, and only k1, k2 and j1, if irreversible. The model used for simulation is shown in Figure [5.4.1.1]. 1 0.25 0.0 2 Figure [5.4.1.1]. Irreversible model used for simulation, with one kinetic state per unit. Only two units are shown. 5.4.1.1 Bias of the estimate The parameters used for simulation are in Table [5.4.1.1.1]. Ten data sets were generated and the estimated rate for each is compared with the rate calculated from the true number of 193 jumps, returned by the simulator. The true rate is equal to the number of jumps divided by the duration of the data. PtsPerSeg SegCnt 2400 1000 Sampling [msec] 62.5 ForRate [sec-1] 0.25 BackRate [sec-1] 0 Table [5.4.1.1.1]. Simulation parameters. PtsPerSeg is the number of points per segment. SegCnt is the number of segments. ForRate and BackRate are the forward and the backward rates, respectively. The parameters of the optimization (kinetics estimation) are in Table [5.4.1.1.2]. A model truncated to 5 units was used (Figure [5.4.1.1.1]) and the jumps in the data were truncated to 3 (although it so happened that no jump higher than 3 was simulated). BackRate 0.0001 TruncOrder 5 MaxJump 3 Table [5.4.1.1.2]. Optimization parameters. BackRate is the backward rate, constrained to be constant. Any jump of order higher than MaxJump is truncated to MaxJump. 1 0.25 0.0 2 0.25 0.0 3 0.25 0.0 4 0.25 0.0 5 Figure [5.4.1.1.1]. Model used for optimization, truncated to five units. The backward rate is actually 0.0001s-1 (but displayed as 0.0. Figure [5.4.1.1.2] presents the results of kinetics estimation. Clearly, there is no obvious bias in the rate estimate. 194 Forward rate 0.2495 0.2490 K12 TrueK12 0.2485 0.2480 0.2475 0 2 4 6 8 10 Data set Figure [5.4.1.1.2]. Bias of the estimate. The forward rate estimated with the algorithm is plotted in red, while the true rate, calculated by dividing the known number of jumps to the data duration, is plotted in blue. Most of the data overlap so that the two colors cannot be distinguished. 5.4.1.2 Precision of the estimate I used the same procedure as in the previous section. Data sets were generated with 1, 10, 100 and 1000 segments. The standard error was calculated for each data set. Figure [5.4.1.2.1] shows the logarithm of the standard error of the estimate to be a linear function of the logarithm of the number of data points. StdErr( Forward Rate) 100 10 1 0.1 0 1 2 3 log10 (Segment count) Figure [5.4.1.2.1]. Precision of the estimate. The standard error of the forward rate estimate is plotted as a function of the amount of data, in a log-log graph. The precision is that expected for Gaussian errors. 5.4.1.3 Effect of jump truncation 195 Statistically, a fraction of dwells will be shorter than the sampling time, resulting in apparent higher order jumps. All jumps of order higher than MaxJump are converted (made equal) to MaxJump. The consequence is that the apparent number of transitions in the data is lowered and thus the apparent estimated kinetics is slower. In general, the jump truncation can be avoided by increasing the model dimension, but undesired numerical issues occur in the computation of the matrix exponential and its derivatives. To test the effect of MaxJump, I have simulated data with a slower sampling, to increase the frequency of multiple jumps (Table [5.4.1.3.1]). I estimated kinetics with three values for the MaxJump parameter, 1, 2 and 3 (Table 5.4.1.3.2]). The estimated forward rate is plotted as a function of the number of jumps present in the data which had to be truncated. It appears that truncating jumps has a slight effect to decrease the rate constant as expected for missed events. PtsPerSeg SegCnt 300 1000 Sampling [msec] 500 ForRate [sec-1] 0.25 BackRate [sec-1] 0 Table [5.4.1.3.1]. Simulation parameters. BackRate 0.0001 TruncOrder 5 MaxJump variable Table [5.4.1.3.2]. Optimization parameters. 196 Forward rate 0.25 0.2 0.15 K12 0.1 TrueK12 0.05 0 8.00E-04 0.025 0.6 % Truncated jumps Figure [5.4.1.3.1]. Effect of jump truncation. The estimated forward rate is plotted against the fraction of jumps that were truncated. The estimated rate is not very sensitive to the jump truncation. 5.4.1.4 Effect of A matrix truncation order To determine the effect of Atr matrix truncation on the estimated kinetics, I used three truncation orders: 4, 5 and 6 (the three models in Figure [5.4.1.4.1]). To make the comparison meaningful, I limited the maximum jump to 3 and I chose simulation conditions such as to lower the number of higher order jumps (Table [5.4.1.4.1]). Table [5.4.1.4.2] has the optimization conditions. The estimated forward rate constant is identical for all truncation orders (up to 6 digits) and I do not show the results. The Atr matrix truncation order has practically no effect, at least for this simple model, and should be chosen as the minimum size that can accommodate the highest jump. PtsPerSeg SegCnt 2400 1000 Sampling [msec] 62.5 ForRate [sec-1] 0.25 Table [5.4.1.4.1]. Simulation parameters. 197 BackRate [sec-1] 0 BackRate 0.0001 TruncOrder variable MaxJump 3 Table [5.4.1.4.2]. Optimization parameters. A 1 0.25 0.0 2 0.25 0.0 3 0.25 0.0 4 0.25 0.0 5 B 1 0.25 0.0 2 0.25 0.0 3 0.25 0.0 4 C 1 0.25 0.0 2 0.25 0.0 3 Figure [5.4.1.4.1]. Optimization models with different truncation orders. A Truncation order 5. B Truncation order 4. C Truncation order 3. 5.4.1.5 Effect of backward rate Ideally, the return rate for the irreversible, left-to-right model should be zero. To avoid confluent eigenvalues in the Qtr matrix, the return rate was assigned a small value and constrained to be fixed with respect to optimization (it is not a free parameter). Table [5.4.1.5.1] has the simulation and Table [5.4.1.5.2] has the optimization parameters. The model in Figure [5.4.1.4.1], A, is used for optimization. 10 data sets are simulated. The forward rate is estimated for each, for four values of the backward rate (0.1, 0.01, 0.001 and 0.0001s -1). For each of these values, the average forward rate is calculated, and then plotted as a function of the backward rate. Obviously, the larger the value of the backward rate, the larger the value of the estimated forward rate. However, the error is very small, even for relatively large values of the backward rate. 198 PtsPerSeg SegCnt 2400 1000 Sampling [msec] 62.5 ForRate [sec-1] 0.25 BackRate [sec-1] 0 Table [5.4.1.5.1]. Simulation parameters. BackRate variable TruncOrder 5 MaxJump 3 Table [5.4.1.5.2]. Optimization parameters. 0.249 Forward rate 0.2485 0.248 0.2475 K12 TrueK12 0.247 0.2465 0.246 1E-04 0.001 0.01 0.1 Backw ard rate Figure [5.4.1.5.1]. Effect of backward rate approximation. The estimated forward rate is plotted in red, versus the backward rate. The true forward rate, calculated by dividing the number of jumps to the data duration, is plotted in blue. The estimated forward rate is relatively insensitive to the value of the backward rate. 5.4.1.6 Log-likelihood surface I test here the shape of the log-likelihood surface as a function of the forward rate constant. Table [5.4.1.6.1] has the simulation and Table [5.4.1.6.2] has the optimization parameters. The model in Figure [5.4.1.4.1], A, is used for optimization. 5 data sets are simulated. The forward rate is estimated for each, and the average is assigned to k12*. The log-likelihood corresponding to k12* is assigned to LL*. The log-likelihood corresponding to different values of k12 is assigned to LL. The difference LL-LL* is plotted in Figure [5.4.1.6.1] against the relative 199 difference (k12-k12*)/k12*. The likelihood surface is smooth and has only one maximum along a large range of values (Figure [5.4.1.6.2]). PtsPerSeg SegCnt 2400 1000 Sampling [msec] 62.5 ForRate [sec-1] 0.25 BackRate [sec-1] 0 Table [5.4.1.6.1]. Simulation parameters. ForRate variable BackRate 0.0001 TruncOrder 5 MaxJump 3 Table [5.4.1.6.2]. Optimization parameters. 250 LL-LL* 200 150 100 50 0 -1e-1 -1e-2 -1e-3 -1e-4 0 1e-4 1e-3 1e-2 1e-1 (k12-k12*)/k12* Figure [5.4.1.6.1]. Log-likelihood surface. On the x axis is the relative difference between the forward rate k12 and the maximum likelihood forward rate k12*. On the y axis is plotted the difference between LL for k12 and the LL* for k12*. 5.4.1.7 Reversible model I test here the behavior of the algorithm with a one state per unit, reversible model. Since the kinesin physical process is known to be biased in one direction, for simulation I chose the backward rate to be ten times smaller than the forward rate. The two rates are now both free parameters, to be optimized. 200 Figure [5.4.1.7.1] shows the reversible model used for simulation. In Figure [5.4.1.7.2] is the simulation model as actually used in QuB. Table [5.4.1.7.1] has the simulation and Table [5.4.1.7.2] has the optimization parameters. Figure [5.4.1.7.3] shows the reversible model used for optimization, of truncation order 5. 10 data sets were simulated. The estimated forward and backward rates for each data set are plotted in Figure [5.4.1.7.4]. Both rate estimates are precise and unbiased. 0.25 1 2 0.03 Figure [5.4.1.7.1]. Reversible model used for simulation. The backward rate is actually 0.025 (but displayed as 0.03). 0.0 3 0.03 0.25 1 2 0.0 Figure [5.4.1.7.2]. Model used for simulation in QuB. The backward rate is actually 0.025 (but displayed as 0.03). PtsPerSeg SegCnt 300 1000 Sampling [msec] 500 ForRate [sec-1] 0.25 BackRate [sec-1] 0.025 Table [5.4.1.7.1]. Simulation parameters. TruncOrder 9 MaxJump 3 Table [5.4.1.7.2]. Optimization parameters. 201 1 0.25 0.03 2 0.25 3 0.03 0.25 0.03 4 0.25 0.03 5 0.25 0.03 6 0.25 0.03 7 0.25 0.03 8 0.25 0.03 9 Figure [5.4.1.7.3]. Model used for optimization. The backward rate is actually 0.025 (but displayed as 0.03). 0.25 Rate 0.20 K12 TrueK12 K21 TrueK21 0.15 0.10 0.05 0.00 0 2 4 6 8 10 Data set Figure [5.4.1.7.4]. Rate estimation for a reversible model. The estimated forward and backward rates are plotted with red and magenta symbols, respectively. The true forward and backward rates, calculated by dividing the known number of jumps (up and down, respectively) to the duration of data, is plotted in blue and green, respectively. Most of the data overlap so that the two colors cannot be distinguished, for either forward or backward rate. 5.4.2 Data idealization The quality of idealization affects both the jump parameters and the kinetic parameters. For staircase data, correct re-estimation of jump parameters is critical. If JumpAmp is underestimated, more apparent jumps are detected in the data and the estimated kinetics is faster. The opposite happens when JumpAmp is overestimated. Whether under- or overestimation occurs, it is at least desirable that level tracking remains reasonably good, and only occasional false positive or false negative jumps are detected. The parameters that affect idealization are the signal-to-noise ratio and the rate of kinetics relative to the sampling frequency (assumed to be taken at the Nyquist limit). The signal-to-noise ratio is rigorously defined as the ratio between the power of the signal and the power of the noise. There are three sources of noise or uncertainty: the variance of the jump (JumpStd), the variance 202 of the measurement (MeasStd) and possible baseline drift. As in Chapter 4, I use here a more intuitive definition of SNR, namely the ratio between the average jump (JumpAmp) and the measurement standard deviation (MeasStd): SNR  JumpAmp MeasStd [5.4.2.1] The definition does not consider the intrinsic jump variance, and thus a non-zero JumpStd lowers the apparent SNR. So does, of course, any drift or fluctuation in the baseline. Experimental data were generated by Dr. P. Selvin’s laboratory with one state per unit, irreversible models, at different signal-to-noise ratios. Without deliberately adding baseline drift, some drift is visible. Similarly, the jumps were intended to be of constant amplitude. However, visual inspection clearly shows variability in the jump amplitude. This is probably due to the resolution of the digital-to-analog converter, which sends the control pulse to the piezoelectric actuator that moves the microscope stage or to various frictional errors in the stage. Testing the idealization algorithms demands simulated data with the same characteristics as the available experimental records. In the next two sections I present results for idealization of simulated data and of experimental data. 5.4.2.1 Idealization of simulated data As the experimental data was generated with a one-state, irreversible model, the simulation was the same (Figure [5.4.2.1.1]). One data set was generated for each of the simulation conditions shown in Fable [5.4.2.1.1]. Figure [5.4.2.1.2] gives visual clues as to the appearance of these simulated data sets. The seed of the random number generator used by the simulator was the same in each experiment, so the generated state (level) sequences were 203 identical, irrespective of the parameters used in simulation. The forward rate constant was estimated from the true idealization (an output of the simulator). The parameters of optimization and the results are in Table [5.4.2.1.2]. The 5-unit model in Figure [5.4.1.4.1], A, was used for optimization. The idealization and optimization parameters common to all experiments are given in Table [5.4.2.1.3]. 1 0.25 0.0 2 Figure [5.4.2.1.1]. Irreversible model used for simulation, with one kinetic state per unit. Only two units are shown. SNR PtsPerSeg SegCnt 300 300 300 300 300 300 300 300 300 300 99 99 99 99 99 99 99 99 99 99 10 10* 10** 5 5* 5** 2 2* 2** 1 Sampling [msec] 500 500 500 500 500 500 500 500 500 500 ForRate [sec-1] 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 BackRate [sec-1] 0 0 0 0 0 0 0 0 0 0 JumpAmp JumpStd MeasStd 1 1 1 1 1 1 1 1 1 1 0.0 0.1 0.2 0.0 0.1 0.2 0.0 0.1 0.2 0.0 0.1 0.1 0.1 0.2 0.2 0.2 0.5 0.5 0.5 1.0 Table [5.4.2.1.1]. Simulation parameters. Trunc Order 5 Max Jump 3 Back Rate 0.0001 For Rate 0.243106 Std Dev 0.00405294 Std Err Trunc Count 0 LL -11331.82266 Table [5.4.2.1.2]. Optimization parameters and results. 204 Forward rate [s-1] 0.25* Idealization Backward rate [s-1] 0.01* Max Jump 12 Kinetics estimation Truncation Max Backward Order Jump rate [s-1] 5 3 0.0001 Table [5.4.2.1.3]. Idealization and optimization parameters common to all experiments. For experiments where all initial parameters for idealization were arbitrary, the forward rate was 1 and the backward rate was 0.1s-1. One of the most important properties of the idealization, or of any algorithm, is the convergence to a stable solution, i.e., self consistency. With experimental data, one does not know the true value of the optimized parameters. The algorithm must then be started from initial, guessed parameters. The initialization of parameters is arbitrary, and so it is important that in general, different initial values lead to the same solution. In other words, it is desirable that the algorithm have a wide basin for the optimal solution attractor. This problem is connected with the smoothness of the likelihood surface. A rough surface translates into many local maxima at which the algorithm can get stuck, although introducing randomness to the optimizer, as in simulated annealing, can increase the chance of finding a global optimum. In the first experiment, all parameters were assigned initial (arbitrary) values within approximately 20% of the true values. I recorded the convergence of the idealization algorithm, focusing on JumpAmp, JumpStd, MeasStd, BaseAmp and LL (log-likelihood). Figures [5.4.2.1.3] through [5.5.2.1.10] show the results. For SNR 2, JumpStd 0.2 and SNR 1 the algorithm did not converge (results not shown). For other noise conditions (SNR 2 to 10, JumpStd 0 to 0.2), the convergence is robust. A second set of experiments determined the importance of each parameter. To all but one parameter I assigned their respective true values. The remaining parameter was assigned different initial values and the convergence observed. BaseAmp appears to have no effect at all, irrespective of SNR (results not shown). In fact, in a preliminary step, BaseAmp is compared against the first data point. If the absolute difference between the two is larger than two 205 measurement noise standard deviations (MeasStd), then BaseAmp is re-assigned the value of the first data point. Thus, a value of BaseAmp lying within two standard deviations of the first data point always results in convergence to the same solution. MeasStd has a very wide range over which the algorithm converges to the same optimal solution (results not shown). For example, for data with SNR 10 and any JumpStd (0.0, 0.1 or 0.2), any value of MeasStd between 1E-10 to 2 gave convergence (true value is 0.1). For SNR 2 and JumpStd 0.0 or 0.1, the convergence interval was (0.01, 2.0), for a true value of 0.5. Clearly, the initial guess of MeasStd anyone is likely to make will fall within this wide interval. JumpAmp is the most critical of all parameters. Its attraction basin is wide for high SNR but gets narrower as SNR degrades and especially as the jump variance rises. Figures [5.5.2.1.11] through [5.5.2.1.18] show the convergence for different starting values of JumpAmp. At lower SNR, multiple attractors become common. Submultiples of the true value (1/2, 1/3, 1/4) have their own attractors. These situations can be easily recognized and corrected. The indicator is the absence of first-order jumps in the idealization, a fact which is statistically highly improbable. Although an exhaustive search in the parameter space is impossible, it appears that the attraction basin is wide enough as to make the algorithm relatively robust with respect to the choice of initial values. For all parameters, it is clear that as the SNR degrades, more iterations are necessary to reach the convergence point. It is common sense that idealization will be more accurate for longer dwell durations (average time kinesin spends in one position). Longer levels can be tracked better and statistics becomes more robust. Other than by modifying the intrinsic rate of the process (the kinetics), the objective of longer dwells can be achieved only by a faster sampling rate, at higher bandwidth. Alas, faster sampling implies that fewer photons contribute to the signal, and fewer photons means higher measurement noise. The variance of the measurement noise is proportional to the square of the number of photons per unit of time. Thus, one may have longer dwells only at the price of higher noise. On the contrary, one may get lower measurement noise by slowing the 206 sampling rate, at the price of increasing the fraction of higher order jumps. Higher order jumps are especially adverse when the jump variance is severe. For example, the variance of a forthorder jump is four times larger than the variance of a single jump. With such large variance, the prediction becomes less accurate and misdetections are likely. The question therefore arises as to what is the optimal compromise between measurement noise and the fraction of higher order jumps. I did experiments to answer this question. Since the lifetime of the fluorophore remains constant, irrespective of sampling, a doubling of the sampling frequency implied a doubling of data length, a halving of frequency implied a halving of length, and so on. In effect, each data set will have the same average number of jumps. The simulated data with SNR 5 (JumpStd 0.0, 0.1 and 0.2) was the reference, with a data length of 300 points and sampling time 500ms. Table [5.4.2.1.4] shows the simulation parameters for each generated data set. Figure [5.5.2.1.18] shows the convergence estimates of JumpAmp, MeasStd and the forward rate. Although it appears that the measurement noise is estimated better when the dwells are longer, the jump amplitude and the forward rate constant are estimated much better when dwells are shorter but the noise is lower. It should be noted that the data set with the lowest noise (SNR 10) corresponds to an average of only 2 points per sample. Pushing towards even lower noise is likely to have a negative effect. 207 SNR 10 10* 10** 7.07 7.07* 7.07** 5 5* 5** 3.5 3.5* 3.5** 2.5 2.5* 2.5** PtsPerSeg 75 75 75 150 150 150 300 300 300 600 600 600 1200 1200 1200 SegCnt 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 Sampling [msec] 2000 2000 2000 1000 1000 1000 500 500 500 250 250 250 125 125 125 ForRate [sec-1] 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 BackRate [sec-1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 JumpAmp JumpStd MeasStd 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.0 0.1 0.2 0.0 0.1 0.2 0.0 0.1 0.2 0.0 0.1 0.2 0.0 0.1 0.2 0.1 0.1 0.1 0.141421 0.141421 0.141421 0.2 0.2 0.2 0.282843 0.282843 0.282843 0.4 0.4 0.4 Table [5.4.2.1.4]. Simulation parameters. 208 1 pA 2s SNR 10 SNR 10* SNR 10** SNR 5 SNR 5* SNR 5** SNR 2 SNR 2* SNR 2** SNR 1 Figure [5.4.2.1.2]. Simulated data with different SNR. The true idealization (the red line) is overlapped on top of each trace to make visible the deviation due to jump variance. JumpAmp is 1 in each trace. MeasStd is 0.1, 0.2, 0.5 and 1.0, for SNR 10, 5, 2 and 1, respectively. One star corresponds to JumpStd 0.1, two stars to JumpStd 0.2 and no star to JumpStd 0.0. 209 SNR 10 (JumpStd 0.0) A JumpAmp 1.35 1.25 1.15 1.05 0.95 1 3 5 7 9 11 13 15 9 11 13 15 7 9 11 13 15 7 9 11 13 15 B JumpStd 0.4 0.3 0.2 0.1 0 1 3 5 7 0.55 C MeasStd 0.45 0.35 0.25 0.15 0.05 0.6 1 3 5 D BaseAmp 0.5 0.4 0.3 0.2 0.1 0 1 3 5 20000 15000 10000 LL E 5000 0 -5000 1 3 5 7 9 11 13 15 -10000 -15000 Iteration Figure [5.4.2.1.3]. Effect of starting values on idealization convergence. Simulated data with SNR 10 and JumpStd 0.0. A JumpAmp. B JumpStd. C MeasStd. D BaseAmp. E Loglikelihood. In all cases, the value of estimated parameters at convergence was the true value, or very close to it. 210 SNR 10* (JumpStd 0.1) A JumpAmp 1.45 1.35 1.25 1.15 1.05 B JumpStd 0.95 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15 1.2 C MeasStd 1 0.8 0.6 0.4 0.2 0 0.6 1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15 D BaseAmp 0.5 0.4 0.3 0.2 0.1 0 20000 E LL 10000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 -10000 -20000 -30000 Iteration Figure [5.4.2.1.4]. Effect of starting values on idealization convergence. Simulated data with SNR 10 and JumpStd 0.1. A JumpAmp. B JumpStd. C MeasStd. D BaseAmp. E Loglikelihood. In all cases, the value of estimated parameters at convergence was the true value, or very close to it. 211 SNR 10** (JumpStd 0.2) A JumpAmp 1.35 1.25 1.15 1.05 B JumpStd 0.95 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15 1.2 C MeasStd 1 0.8 0.6 0.4 0.2 0 0.6 1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15 D BaseAmp 0.5 0.4 0.3 0.2 0.1 0 20000 E LL 10000 0 1 3 5 7 9 11 13 15 -10000 -20000 Iteration Figure [5.4.2.1.5]. Effect of starting values on idealization convergence. Simulated data with SNR 10 and JumpStd 0.2. A JumpAmp. B JumpStd. C MeasStd. D BaseAmp. E Loglikelihood. In all cases, the value of estimated parameters at convergence was the true value, or very close to it. 212 SNR 5 (JumpStd 0.0) A JumpAmp 1.35 1.25 1.15 1.05 0.95 0.5 1 3 5 7 9 11 13 15 9 11 13 15 9 11 13 15 9 11 13 15 B JumpStd 0.4 0.3 0.2 0.1 0 1 3 5 7 1.05 C MeasStd 0.85 0.65 0.45 0.25 0.05 0.6 1 3 5 7 D BaseAmp 0.5 0.4 0.3 0.2 0.1 0 1 0 E LL -5000 3 1 5 3 7 5 7 9 11 13 15 -10000 -15000 -20000 -25000 Iteration Figure [5.4.2.1.6]. Effect of starting values on idealization convergence. Simulated data with SNR 5 and JumpStd 0.0. A JumpAmp. B JumpStd. C MeasStd. D BaseAmp. E Loglikelihood. In all cases, the value of estimated parameters at convergence was the true value, or very close to it. 213 SNR 5* (JumpStd 0.1) A JumpAmp 1.35 1.25 1.15 1.05 0.95 0.5 1 3 5 7 9 11 13 15 B JumpStd 0.4 0.3 0.2 0.1 0 1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15 1 0 3 5 7 9 11 13 15 1.2 C MeasStd 1 0.8 0.6 0.4 0.2 0 0.6 D BaseAmp 0.5 0.4 0.3 0.2 0.1 0 E LL -5000 1 3 5 7 9 11 13 15 -10000 -15000 -20000 -25000 Iteration Figure [5.4.2.1.7]. Effect of starting values on idealization convergence. Simulated data with SNR 5 and JumpStd 0.1. A JumpAmp. B JumpStd. C MeasStd. D BaseAmp. E Loglikelihood. In all cases, the value of estimated parameters at convergence was the true value, or very close to it. 214 SNR 5** (JumpStd 0.2) A JumpAmp 1.45 1.35 1.25 1.15 1.05 0.95 0.4 1 3 5 7 9 11 13 15 17 19 21 1 3 5 7 9 11 13 15 17 19 21 B JumpStd 0.35 0.3 0.25 0.2 0.15 1.2 C MeasStd 1 0.8 0.6 0.4 0.2 0 0.6 1 3 5 7 9 11 13 15 17 19 21 1 0 3 5 7 9 11 13 15 17 19 21 D BaseAmp 0.5 0.4 0.3 0.2 0.1 0 E LL -5000 1 3 5 7 9 11 13 15 17 19 21 -10000 -15000 -20000 -25000 Iteration Figure [5.4.2.1.8]. Effect of starting values on idealization convergence. Simulated data with SNR 5 and JumpStd 0.2. A JumpAmp. B JumpStd. C MeasStd. D BaseAmp. E Loglikelihood. In all cases, the value of estimated parameters at convergence was the true value, or very close to it. 215 SNR 2 (JumpStd 0.0) A JumpAmp 1.2 1.15 1.1 1.05 1 0.95 1 B JumpStd 0.4 6 11 16 21 26 31 36 41 46 51 56 61 0.3 0.2 0.1 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 1.05 C MeasStd 0.95 0.85 0.75 0.65 0.55 0.45 0.6 1 6 11 16 21 26 31 36 41 46 51 56 61 D BaseAmp 0.5 0.4 0.3 0.2 0.1 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 -27500 E LL -28000 1 6 11 16 21 26 31 36 41 46 51 56 61 -28500 -29000 -29500 -30000 Iteration Figure [5.4.2.1.9]. Effect of starting values on idealization convergence. Simulated data with SNR 2 and JumpStd 0.0. A JumpAmp. B JumpStd. C MeasStd. D BaseAmp. E Loglikelihood. This appears to be the lowest SNR to be worthy of analysis. 216 SNR 2* (JumpStd 0.1, kv0.05) 1.25 A JumpAmp 1.2 1.15 1.1 1.05 1 C JumpStd B 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 MeasStd 0.95 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 1 4 7 10 13 16 19 22 25 28 31 34 37 40 1 4 7 10 13 16 19 22 25 28 31 34 37 40 1 4 7 10 13 16 19 22 25 28 31 34 37 40 1 4 7 10 13 16 19 22 25 28 31 34 37 40 0.6 D BaseAmp 0.5 0.4 0.3 0.2 0.1 0 -27000 -27500 1 4 7 10 13 16 19 22 25 28 31 34 37 40 E LL -28000 -28500 -29000 -29500 -30000 -30500 Iteration Figure [5.4.2.1.10]. Effect of starting values on idealization convergence. Simulated data with SNR 2 and JumpStd 0.0. A JumpAmp. B JumpStd. C MeasStd. D BaseAmp. E Loglikelihood. The kv factor was 0.05, instead of 0.1 for the other experiments (equation [5.3.1.1.9]). 217 A SNR 10* (sim JumpStd 0.1) 1.7 BaseStd 0.001 Kv 0.1 1.5 JumpAmp 1.3 1.1 0.9 0.7 0.5 0.3 0.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Iteration B True Estimated JumpAmp 1.0 1.00011 JumpStd 0.1 0.0990693 MeasStd 0.1 0.0935188 BaseAmp 0.0 0.00143537 IDL_LL JumpRate 16873.5888 0.25 1 pA 5s 0.243038 ±StdErr 0.016674 KIN_LL -11329.02195 C Figure [5.4.2.1.12]. Effect of initial JumpAmp values on idealization. Simulated data with SNR 10 and JumpStd 0.1. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the convergence point. C Idealization corresponding to the convergence point. 218 A SNR 10** (sim JumpStd 0.2) 1.7 BaseStd 0.001 Kv 0.1 1.5 JumpAmp 1.3 1.1 0.9 0.7 0.5 0.3 0.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Iteration B True Estimated JumpAmp 1.0 1.0045 JumpStd 0.2 0.182314 MeasStd 0.1 0.0975591 BaseAmp 0.0 0.00143537 IDL_LL JumpRate 15851.34326 0.25 1 pA 5s 0.242633 ±StdErr 0.016688 KIN_LL -11338.0933 C Figure [5.4.2.1.13]. Effect of initial JumpAmp values on idealization. Simulated data with SNR 10 and JumpStd 0.2. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the convergence point. C Idealization corresponding to the convergence point. 219 A SNR 5 1.7 BaseStd 0.001 Kv 0.1 1.5 JumpAmp 1.3 1.1 0.9 0.7 0.5 0.3 0.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Iteration B True Estimated JumpAmp 1.0 0.999997 JumpStd 0.0 0.0889965 MeasStd 0.2 0.186533 BaseAmp 0.0 0.0046777 IDL_LL JumpRate 3495.257889 0.25 1 pA 5s 0.243038 ±StdErr 0.016674 KIN_LL -11332.4876 C Figure [5.4.2.1.14]. Effect of initial JumpAmp values on idealization. Simulated data with SNR 5 and JumpStd 0.0. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the convergence point. C Idealization corresponding to the convergence point. 220 A SNR 5* (sim JumpStd 0.1) 1.7 BaseStd 0.001 Kv 0.1 1.5 JumpAmp 1.3 1.1 0.9 0.7 0.5 0.3 0.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Iteration B True Estimated JumpAmp 1.0 0.999395 JumpStd 0.1 0.125505 MeasStd 0.2 0.188138 BaseAmp 0.0 0.00473288 IDL_LL JumpRate -3604.803269 0.25 1 pA 5s 0.243984 ±StdErr 0.016641 KIN_LL -11358.96023 C Figure [5.4.2.1.15]. Effect of initial JumpAmp values on idealization. Simulated data with SNR 5 and JumpStd 0.1. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the convergence point. C Idealization corresponding to the convergence point. 221 A SNR 5** (sim JumpStd 0.2) 1.7 BaseStd 0.001 Kv 0.1 1.5 JumpAmp 1.3 1.1 0.9 0.7 0.5 0.3 0.1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 Iteration B True Estimated JumpAmp 1.0 0.999827 JumpStd 0.2 0.195756 MeasStd 0.2 0.194197 BaseAmp 0.0 0.0122417 IDL_LL JumpRate -4222.133304 0.25 1 pA 5s 0.244457 ±StdErr 0.016626 KIN_LL -11406.03537 C Figure [5.4.2.1.16]. Effect of initial JumpAmp values on idealization. Simulated data with SNR 5 and JumpStd 0.2. Two close convergence points. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the best convergence point. C Idealization corresponding to the best convergence point. 222 A SNR 2 1.7 1.5 BaseStd 0.001 Kv 0.1 (0.01) JumpAmp 1.3 1.1 0.9 0.7 0.5 0.3 0.1 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 Iteration B True Estimated JumpAmp 1.0 0.999261 JumpStd 0.0 0.2326 MeasStd 0.5 0.479004 BaseAmp 0.0 0.0176407 IDL_LL JumpRate -28024.97234 0.25 1 pA 5s 0.241957 ±StdErr 0.016711 KIN_LL -11236.97433 C Figure [5.4.2.1.17]. Effect of initial JumpAmp values on idealization. Simulated data with SNR 2 and JumpStd 0.0. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the convergence point. C Idealization corresponding to the convergence point. 223 A SNR 2* (sim JumpStd 0.1) 1.7 BaseStd 0.001 Kv 0.1 1.5 JumpAmp 1.3 1.1 0.9 0.7 0.5 0.3 0.1 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 Iteration B True Estimated JumpAmp 1.0 0.993487 JumpStd 0.1 0.249441 MeasStd 0.5 0.480998 BaseAmp 0.0 0.0185922 IDL_LL JumpRate -28094.42772 0.25 1 pA 5s 0.244254 ±StdErr 0.016632 KIN_LL -11327.69503 C Figure [5.4.2.1.18]. Effect of initial JumpAmp values on idealization. Simulated data with SNR 2 and JumpStd 0.1. Two close convergence points. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the best convergence point. C Idealization corresponding to the best convergence point. 224 A B 1.08 C 0.4 0.25 1.07 0.35 0.245 1.06 0.3 True JumpStd 0.0 1.03 JumpStd 0.1 JumpStd 0.2 1.02 0.24 True JumpStd 0.0 JumpStd 0.1 JumpStd 0.2 0.235 1.01 MeasStd 1.04 Forward rate JumpAmp 1.05 True 0.25 JumpStd 0.0 JumpStd 0.1 0.2 JumpStd 0.2 0.15 0.23 1 0.1 0.99 0.225 0.98 10 7.07 5 SNR 3.5 2.5 0.05 10 7.07 5 SNR 3.5 2.5 10 7.07 5 3.5 2.5 SNR Figure [5.4.2.1.19]. Compromise between measurement noise and jump order. Simulated data with different SNR, JumpStd 0.0, 0.1 and 0.2. A doubling in SNR corresponds to an average dwell duration (in data points) four times smaller. Likewise, a halving in SNR corresponds to an average dwell duration four times larger. A Convergence value of JumpAmp at different SNR. B Convergence value of forward rate at different SNR. C Convergence value of MeasStd at different SNR. The thin black line is the true value. The thick red, blue and green lines are the estimates from data with JumpStd 0.0, 0.1 and 0.2, respectively. If the simulated JumpStd is 0.0 or 0.1, the estimates of JumpAmp and forward rate are very accurate, regardless of SNR and average dwell duration. For simulated JumpStd 0.2, the accuracy of the two estimates drops, especially at low SNR. As expected, the accuracy of the MeasStd estimate is better for longer dwells, even though the SNR is low. 225 5.4.2.2 Idealization of experimental data The experimental data were all generated by Dr. Selvin’s laboratory with an irreversible, one state per position model. The files were recorded with a sampling time of 500 ms. The jump amplitude varied, taking values of ~7.5 nm, ~13 nm and ~30 nm. Some files were generated with equal steps, of ~ 8 points. Both the stochastic and the deterministic steps files correspond to a forward rate of 0.25s-1. The experiment mimicked the characteristics of the signal obtained when the fluorophore is positioned on the body of the kinesin molecule. The amplitude of each step was meant to be constant. Nevertheless, for reasons explained previously, the jumps appear to have a non-zero variance. There is no significant baseline drift visible. Figures [5.4.2.2.1] through [5.4.2.2.18] show idealization and kinetics estimation results. As with simulated data, JumpAmp has a wide attraction basin in the low noise records, which narrows down and presents multiple maxima in high noise records. The kv parameter determines the number of these maxima. A value of 0.1 induces multiple maxima, while a value of 0.5 oftentimes eliminates all but one. However, the one left maximum is not necessarily the best, as the human eye would estimate. Overall, the idealization works even for the highest noise data. What seems to have the most negative effects is the jump variance. Visual inspection indicates a non-Gaussian distribution of jumps, almost bimodal. Hopefully, data from true kinesin experiments will be better! 226 A 30nm_1_1_NoOffset.qdf 4.5 4 BaseStd 0.001 Kv 0.1 JumpAmp 3.5 3 2.5 2 1.5 1 0.5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Iteration B True Estimated JumpAmp 2.84716 JumpStd 0.166185 MeasStd 0.299578 BaseAmp -0.00692272 IDL_LL -134.5648837 JumpRate 0.25 2 pA 10 s 0.255714 ±StdErr 0.188982 KIN_LL -89.05796189 C Figure [5.4.2.2.1]. Idealization and kinetics estimation of experimental data. JumpAmp ~30 nm. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the convergence point. C Idealization corresponding to the convergence point. Here and in the rest of figures, the idealization(s) in C refer to the convergence trajectories represented by the blue lines in A. 227 A 13nm_1_1_NoOffset.qdf 20 BaseStd 0.001 Kv 0.1 JumpAmp 15 10 5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 Iteration B True 10.4767 11.8754 JumpStd 0.722093 1.11356 MeasStd 1.35247 1.36616 BaseAmp 0.0531579 0.0531579 IDL_LL -200.4748506 -196.3965068 JumpRate C Estimated JumpAmp 0.258071 0.215059 ±StdErr 0.25 0.288674 0.316179 KIN_LL -37.265457 -32.30015296 10.4767 11.8754 10 pA 2s Figure [5.4.2.2.2]. Idealization and kinetics estimation of experimental data. JumpAmp ~13 nm. Two convergence points. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the two convergence points. C Idealizations corresponding to the two convergence points. In the top trace, one jump is detected as second-order, and another one is one point long, with a very small amplitude. The second convergence point should thus be preferred. 228 A 13nm_1_1_NoOffset.qdf BaseStd 0.001 Kv 1 20 JumpAmp 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Iteration B True 10.5368 11.8754 JumpStd 0.737277 1.11356 MeasStd 1.36616 1.36616 BaseAmp 0.0531579 0.0531579 IDL_LL -201.400531 -196.3965068 JumpRate C Estimated JumpAmp 0.258071 0.215059 ±StdErr 0.25 0.288674 0.316179 KIN_LL -37.95858698 -32.30015296 10.5368 11.8754 10 pA 2s Figure [5.4.2.2.3]. Idealization and kinetics estimation of experimental data. JumpAmp ~13 nm. Two convergence points. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the two convergence points. C Idealizations corresponding to the two convergence points. 229 13nm_R_1_2_NoOffset.qdf A 20 BaseStd 0.001 Kv 0.1 JumpAmp 15 10 5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 Iteration True B 11.796 12.4732 JumpStd 1.53067 1.30292 MeasStd 1.97147 1.88304 BaseAmp -0.00083135 -0.000833217 IDL_LL -272.1727872 -271.7227696 0.252258 0.23424 ±StdErr 0.267265 0.277356 KIN_LL -42.98663492 -40.87956366 JumpRate C Estimated JumpAmp 0.25 11.796 12.4732 10 pA 5s Figure [5.4.2.2.4]. Idealization and kinetics estimation of experimental data. JumpAmp ~13 nm, regular steps. Two convergence points. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the two convergence points. C Idealizations corresponding to the two convergence points. The first trace has a one-point long level, falsely detected (encircled area, three levels before the end of trace). The second convergence point should be preferred. 230 A 13nm_R_1_2_NoOffset.qdf 20 BaseStd 0.001 Kv 0.5 JumpAmp 15 10 5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 Iteration B True Estimated JumpAmp 12.5017 JumpStd 1.30159 MeasStd 1.90341 BaseAmp -0.000833193 IDL_LL -272.0105373 JumpRate 0.25 0.23424 ±StdErr 0.277356 KIN_LL -40.87956366 10 pA 5s C 12.4732 Figure [5.4.2.2.5]. Idealization and kinetics estimation of experimental data. JumpAmp ~13 nm, regular steps. Only one convergence point remains. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the convergence point. C Idealizations corresponding to the convergence point. 231 A 13nm_R_1_3_NoOffset.qdf 20 BaseStd 0.001 Kv 0.1 JumpAmp 15 10 5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Iteration B True Estimated JumpAmp 12.1984 12.6826 JumpStd 1.78983 1.70529 MeasStd 2.54954 2.68061 BaseAmp 0.00752546 0.00754855 IDL_LL -524.7856073 -534.5397797 JumpRate 0.248711 0.238348 ±StdErr 0.25 0.204126 0.208518 KIN_LL -74.03129748 -71.92553022 12.1984 C 12.6826 10 pA 5s Figure [5.4.2.2.6]. Idealization and kinetics estimation of experimental data. JumpAmp ~13 nm, regular steps. Two convergence points. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the two convergence points. C Idealizations corresponding to the two convergence points. Both traces present idealization mistakes (each dwell should be 8 points long). The bottom trace skips a level. 232 13nm_R_1_3_NoOffset.qdf A BaseStd 0.001 Kv 0.5 20 JumpAmp 15 10 5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Iteration True B 10.495 11.0631 12.1941 JumpStd 2.59511 2.46728 1.82039 MeasStd 2.43343 2.52081 2.5148 BaseAmp 0.00762281 0.00758933 0.00751988 IDL_LL -518.2468264 -521.2401661 -524.4381954 0.290162 0.269436 0.248711 ±StdErr 0.188982 0.196109 0.204126 KIN_LL -82.05363286 -78.11946411 -74.03129748 JumpRate C Estimated JumpAmp 0.25 10.495 11.0631 12.1941 5 pA 5s Figure [5.4.2.2.7]. Idealization and kinetics estimation of experimental data. JumpAmp ~13 nm, regular steps. Three convergence points. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the three convergence points. C Idealizations corresponding to the three convergence points. All traces show idealization mistakes (each dwell should be 8 points long). 233 8nm_1_1_NoOffset.qdf A 14 BaseStd 0.001 Kv 0.1 12 JumpAmp 10 8 6 4 2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Iteration True B Estimated JumpAmp 5.79015 7.84122 8.26262 JumpStd 1.37395 1.25819 1.27502 MeasStd 2.02905 2.14578 2.15461 BaseAmp 0.00788369 0.00567071 0.00535911 -521.4593945 -525.0238662 -525.5139116 0.388069 0.288564 0.278614 ±StdErr 0.160131 0.185695 0.188971 KIN_LL -110.6922685 -86.53055796 -84.57710351 IDL_LL JumpRate 0.25 5.79015 7.84122 8.26262 C 5 pA 5 s Figure [5.4.2.2.8]. Idealization and kinetics estimation of experimental data. JumpAmp ~8 nm. Multiple convergence points. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the convergence points. C Idealizations corresponding to the convergence points. All traces show idealization mistakes (there should be no downward jumps). 234 8nm_1_1_NoOffset.qdf A 14 12 BaseStd 0.001 Kv 0.5 JumpAmp 10 8 6 4 2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Iteration True B Estimated JumpAmp 7.82956 8.25788 8.41777 JumpStd 1.38656 1.32626 1.34987 MeasStd 2.13227 2.16621 2.24888 BaseAmp 0.00482881 0.00499026 0.00763731 -521.4143721 -526.0792001 -530.3195459 0.288564 0.278614 0.268663 ±StdErr 0.185695 0.188971 0.192449 KIN_LL -86.53055796 -84.57710351 -82.58792727 IDL_LL JumpRate 0.25 7.82956 8.25788 8.41777 C 5 pA 5 s Figure [5.4.2.2.9]. Idealization and kinetics estimation of experimental data. JumpAmp ~8 nm. Multiple convergence points. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the convergence points. C Idealizations corresponding to the convergence points. All traces show idealization mistakes (there should be no downward jumps). 235 A 8nm_1_2_NoOffset.qdf 10 JumpAmp 8 6 BaseStd 0.001 Kv 0.1 4 2 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 Iteration B True Estimated JumpAmp 8.56038 JumpStd 1.24968 MeasStd 2.65977 BaseAmp 0.429862 IDL_LL -983.8559741 JumpRate 0.25 10 pA 10 s 0.231411 ±StdErr 0.154309 KIN_LL -133.2759656 C Figure [5.4.2.2.10]. Idealization and kinetics estimation of experimental data. JumpAmp ~8 nm. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the convergence point. C Idealizations corresponding to the convergence point. The idealization shows mistakes (there should be no downward jumps). 236 8nm_1_2_NoOffset.qdf A 10 JumpAmp 8 6 4 BaseStd 0.001 Kv 0.05 2 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 Iteration True B Estimated JumpAmp 6.65256 8.20798 8.55163 JumpStd 1.83096 1.47671 1.36764 MeasStd 2.45946 2.5698 2.63793 BaseAmp 0.530106 0.328046 0.400987 IDL_LL -971.179494 -977.0504472 -984.7713595 JumpRate 0.297528 0.24794 0.236921 ±StdErr 0.25 0.136083 0.149071 0.152487 KIN_LL -158.2789413 -138.948362 -135.4208894 6.65256 8.20798 8.55163 C 5 pA 10 s Figure [5.4.2.2.11]. Idealization and kinetics estimation of experimental data. JumpAmp ~8 nm. Multiple convergence points. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the convergence points. C Idealizations corresponding to the convergence points. All traces show idealization mistakes (there should be no downward jumps). 237 8nm_1_3_NoOffset.qdf A 10 JumpAmp 8 6 4 BaseStd 0.001 Kv 0.1 2 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 Iteration True B Estimated JumpAmp 7.65272 8.39352 JumpStd 2.43526 2.13624 MeasStd 2.58248 2.54294 BaseAmp 0.110356 0.0693778 IDL_LL -1282.867179 -1279.352916 JumpRate 0.26051 0.239502 ±StdErr 0.25 0.126999 0.132455 KIN_LL -189.0667724 -178.6680823 7.65272 8.39352 C 5 pA 10 s Figure [5.4.2.2.12]. Idealization and kinetics estimation of experimental data. JumpAmp ~8 nm. Two convergence points. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the convergence points. C Idealizations corresponding to the convergence points. Both traces show idealization mistakes (there should be no downward jumps). 238 8nm_1_4_NoOffset.qdf A 10 JumpAmp 8 BaseStd 0.001 Kv 0.1 6 4 2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 Iteration True B Estimated JumpAmp 7.67527 8.08173 JumpStd 1.22278 1.01453 MeasStd 1.64263 1.73481 BaseAmp 0.00398573 0.00543289 IDL_LL -219.8413309 -220.1677031 JumpRate 0.304355 0.282615 ±StdErr 0.25 0.267283 0.277333 KIN_LL -41.0513851 -38.43892516 7.67527 8.08173 C 5 pA 2s Figure [5.4.2.2.13]. Idealization and kinetics estimation of experimental data. JumpAmp ~8 nm. Two convergence points. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the convergence points. C Idealizations corresponding to the convergence points. Both traces show idealization mistakes. 239 8nm_1_4_NoOffset.qdf A 10 JumpAmp 8 6 BaseStd 0.001 Kv 0.5 4 2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 Iteration True B Estimated JumpAmp 7.67607 8.08182 JumpStd 1.18239 1.01406 MeasStd 1.63747 1.73358 BaseAmp 0.00345376 0.00451518 IDL_LL -219.8943929 -220.1981786 JumpRate 0.304355 0.282615 ±StdErr 0.25 0.267283 0.277333 KIN_LL -41.0513851 -38.43892516 7.67607 8.08182 C 5 pA 2s Figure [5.4.2.2.14]. Idealization and kinetics estimation of experimental data. JumpAmp ~8 nm. Multiple convergence points. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the convergence points. C Idealizations corresponding to the convergence points. Both traces show idealization mistakes. 240 A 8nm_R_1_1_NoOffset.qdf BaseStd 0.001 Kv 0.1 10 JumpAmp 8 6 4 2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 Iteration Figure [5.4.2.2.15]. Idealization and kinetics estimation of experimental data. JumpAmp ~8 nm, regular steps. Multiple convergence points. A Convergence of estimated JumpAmp. Data, idealization and estimated parameters are not shown, since no choice can be made. The value of kv (0.1) is too small in this case (see next figure) and the idealization is very sensitive to initial conditions. 241 8nm_R_1_1_NoOffset.qdf A 10 BaseStd 0.001 Kv 0.5 JumpAmp 8 6 4 2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 Iteration True B Estimated JumpAmp 5.93294 6.13367 6.43941 JumpStd 1.18905 1.29722 1.65529 MeasStd 1.83456 1.78771 1.87153 BaseAmp -0.531375 -0.530523 -0.529974 -827.3844281 -829.2303198 -838.3808526 0.255688 0.255688 0.244324 ±StdErr 0.149072 0.149072 KIN_LL -137.5636388 -138.2567688 0.152500 -133.4045793 IDL_LL JumpRate 0.25 5.93294 6.13367 6.43941 C 2 pA 10 s Figure [5.4.2.2.16]. Idealization and kinetics estimation of experimental data. JumpAmp ~8 nm, regular steps. Multiple convergence points. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the convergence points. C Idealizations corresponding to the convergence points. Both traces show idealization mistakes. Compared to the previous figure, kv was increased to 0.5, and the idealization is less sensitive to initial conditions. 242 8nm_R_1_2_NoOffset.qdf A 10 BaseStd 0.001 Kv 0.1 JumpAmp 8 6 4 2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 Iteration True B 6.4092 6.89654 JumpStd 1.08596 1.13374 MeasStd 1.61513 1.83598 BaseAmp -0.548173 -0.0347897 -230.0236438 -238.8599363 0.252433 0.233015 ±StdErr 0.277354 0.288682 KIN_LL -39.90714887 -37.79787971 IDL_LL JumpRate C Estimated JumpAmp 0.25 6.4092 6.89654 5 pA 5s Figure [5.4.2.2.17]. Idealization and kinetics estimation of experimental data. JumpAmp ~8 nm, regular steps. Multiple convergence points. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the convergence points. C Idealizations corresponding to the convergence points. Both traces show idealization mistakes. Bottom trace skips a level. 243 8nm_R_1_2_NoOffset.qdf A 10 BaseStd 0.001 Kv 0.5 JumpAmp 8 6 4 2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 Iteration B True Estimated JumpAmp 6.92142 JumpStd 1.22733 MeasStd 1.78108 BaseAmp 0.00211386 IDL_LL -237.8919407 JumpRate 0.25 5 pA 5s 0.233015 ±StdErr 0.288682 KIN_LL -37.79787971 C Figure [5.4.2.2.18]. Idealization and kinetics estimation of experimental data. JumpAmp ~8 nm, regular steps. A single convergence point. A Convergence of estimated JumpAmp. B Estimated parameters corresponding to the convergence point. C Idealizations corresponding to the convergence point. The trace skips a level. Compared to the previous figure, a higher value of kv (0.5), gives a single convergence point, irrespective of initial conditions. 244 5.5 Discussion It is interesting to note how the idealization algorithm has stable maxima not only for the true value of the jump amplitude but also for its submultiples of order 2, 3, 4 etc. For example, a JumpAmp equal to 1/2 of the true value gives twice faster apparent kinetics, 1/3 gives thrice faster etc. Confusion is avoided however, by inspecting the re-estimated Atr matrix. Figure [5.5.1] shows the structure of the matrices corresponding to the estimated JumpAmp equal to 1/1, 1/2 and 1/3 of the true jump amplitude. For 1/1, the transition probabilities show an exponential decay. For 1/2, they are interspersed with zeroes, caused by lack of evidence in the data for transitions between consecutive states, and so on. In fact, a non-smooth shape of the Atr matrix might indicate convergence to a non-optimal solution. a[i,i+k] 1 0.8 1 0.6 1/3 1/2 0.4 0.2 0 0 1 2 3 4 5 k Figure [5.5.1]. Structure of the Atr matrix. JumpAmp is equal to 1/1, 1/2, and 1/3 of the true value. a[i,i+k] is the entry in the Atr matrix equal to the transition probability between state i and state i+k, for any i, k>=0. Irreversible model with one kinetic state per unit. Only for reversible models can the Atr matrix be calculated from Qtr by spectral decomposition. For irreversible models, Qtr has confluent eigenvalues. One must then use small, non-zero backward rates, to force reversibility. Using a reversible Atr in the idealization algorithm has the potential for false detection of downward jumps, when, in fact, such jumps are known not to happen. The Atr matrix must be prepared, by zeroing the entries corresponding to 245 backward transitions and row-wise renormalization. It is a property of the Forward-Backward procedure that zero entries in the A matrix stay zero throughout the re-estimation. I have tested both reversible and irreversible Atr matrices, with mixed results (not shown). An irreversible Atr avoids, indeed, false detection of downward jumps. On the other hand, with high noise data, it may occasionally result in less accurate parameters. If it is known a priori that only upward jumps can occur, either an irreversible Atr must be used, or the false downward jumps detected must be somehow eliminated. When estimating kinetics for irreversible models, or when the analyzed data lacks jumps in one direction, the model can be simplified. Thus, either the left or the right block states can be eliminated, since those states will never be occupied in the likelihood chain [5.3.2.1.1]. The Atr truncation error will likely be negligible. It was not my objective here but further work is definitely required to determine the parameter identifiability. In other words, how many free parameters can be estimated, when the model has all states aggregated in one class? Clearly, at least two parameters can be estimated from reversible data, because there is actually evidence for each: the upward and the downward jumps. 246 6. Ensemble hidden Markov models. Applications to kinetic analysis of whole-cell recordings Ion channel kinetics is traditionally estimated from macroscopic currents by the method of fitting exponentials. The macroscopic current relaxes towards equilibrium as a weighted sum of NS-1 exponentials, where NS is the number of states in the model (Colquhoun and Hawkes, 1977). The time constants of the exponentials are the inverses of the Q matrix eigenvalues, and the weighting factors are functions of the Q matrix eigenvectors. From the optimized time constants and weighting factors, it is not easy to obtain the corresponding Q matrix (and thus the rate constants), and, therefore, it is tempting to use single channel algorithms. An ensemble of independent, identical Markov models is yet another Markov model (Horn and Lange, 1983), hence single channel algorithms can be easily extended to multiple channels (see, for example, Qin et al, 1996), but are practically useless when the channel count is greater than e.g. 10. NC individual channels, each a Markov model, give a superposition model (meta-model) with N S NC states (meta-states). For example, an ensemble with 10 two-state channels has 1024 meta-states, but an ensemble of 100 channels has ~1E30 meta-states! The average whole-cell patch may have a few orders of magnitude more channels. Clearly, these numbers are not amenable to treatment by single channel algorithms. I propose here an approximate maximum likelihood method that estimates directly the rate constants, using hidden Markov models. Optionally, the method can estimate simultaneously the rate constants and the number of channels in the patch. This is, to my knowledge, a novel procedure. I refer to this new algorithm as Mac (Macroscopic currents optimizer). The principle of the method is to model the observed current at any time t, It, as a stochastic quantity, approximated by a Gaussian. The mean and variance of this Gaussian are functions of conductance parameters, starting probabilities, rate constants and channel count. The critical 247 assumption is that there is no correlation between any two data points. In other words, the state probabilities at any time t are a function of the initial probabilities only. The user must provide a kinetic model, conductance parameters (i.e. single channel current amplitude and noise for each conductance class), and initial estimates for rate constants and channel count. The initial state probabilities can be user-input, or can be calculated by the program when conditioning-test data are analyzed. Hence, the ion channel population is assumed to be driven to equilibrium under the conditioning experimental conditions, and to relax towards a new equilibrium under the test conditions. Thus, the state probabilities at the beginning of test can be calculated as the equilibrium occupancies of conditioning. In principle, any kind of experimental protocol can be analyzed, i.e. a sequence of two or more rectangular pulses, with defined duration and value (ligand concentration, voltage etc). As with the MIP algorithm described in Chapter 3, linear constraints on rate constants can be imposed, thus reducing the number of free parameters. The algorithm provides standard errors of the estimates. 6.1 Ensemble hidden Markov model I consider here an ensemble of NC identical, independent (non-interacting) ion channels. The single channel model has NS states. At any time t, each of the NC channels is in one of the NS states. Let Nti be the number of channels in state i, at time t. I define the state partition vector, Xt, as follows: N ti x  NC i t x i i t  [6.1.1] 1 NC N i 248 i t 1 [6.1.2] The element i of the state partition vector, xti, is equal to the fraction of channels in state i at time t. The elements of Xt sum to 1. Xt takes discrete values, but it is continuous in the limit case of an infinite number of channels, when it is equivalent to the state probability vector of a single channel, Pt: X  Pt [6.1.3] t NC  xti  Pstatet  i  [6.1.4] The equations describing the time dependence of an ensemble of molecules have identical form with the equations for a single molecule, as mentioned in Chapter 2.1.1. It follows that the state partition vector Xt can be calculated from the initial state probability vector, as in equation [2.1.1.32]: Xt  X0  expQ  t  T T [6.1.5] 6.1.1 Conductance properties of an ensemble of ion channels The signal and noise components in a single channel patch clamp record were presented in Figure [4.3.1], and discussed in Chapter 4.3. In the following, I assume the signal and noise to have characteristics constant in time. Furthermore, all Gaussian noise is assumed to be noncorrelated (white). The current given by a single channel when in state i can be modeled as a random Gaussian variable with mean i and variance Vyi. i and Vyi are, respectively, the mean of the excess current and the variance of the excess noise of state i. By definition, all states in the close 249 conductance class have zero excess current and zero excess noise. Let  and Vy be the vectors of dimension NS, with elements i and Vyi as defined before. The baseline current is another random Gaussian variable with mean 0 and variance Vy0. Since a sum of random Gaussian variables is another random Gaussian variable, the current given at time t by an ensemble of ion channels has a Gaussian distribution, as follows:  I t  N  t ,Vt y  [6.1.1.1] NS  t   0   N ti   i [6.1.1.2] i 1 NS Vt y  V y 0   N ti  Vi [6.1.1.3] i 1 NS  t    N C   xti   i 0 [6.1.1.4] i 1 Vt  V y NS y0  N C   xti  Vi [6.1.1.5] i 1 The mean t and the variance and Vty are functions of the state partition vector (Xt), channel count (NC), baseline current (0, Vy0) and single channel current mean and variance vectors ( and Vy). Equations [6.1.1.4] and [6.1.1.5] can be written in a more compact, vectorial form, as follows:  t   0  N C  Xt  μ [6.1.1.6] Vt y  V y 0  N C  Xt  V [6.1.1.7] T T 250 Equations [6.1.1.6] and [6.1.1.7] give the mean and the variance of the current recorded at time t, from an ensemble of ion channels, given a state partition vector Xt. 6.1.2 State fluctuations of an ensemble of ion channels As mentioned previously, the state partition vector Xt can be calculated (predicted) from a known initial state probability vector X0, with equation [6.1.5]. However, the predicted Xt is the mean of a multinomial probability distribution, as discussed in Chapter 2.1.1 (see equation [2.1.1.42] and the following), and the actual state vector Xt is expected to have fluctuations around the mean. The covariance matrix of these multinomial fluctuations, VtMN, is a function of the number of channels and the expected (average) state partition vector: VtMN  N C  Vtx [6.1.2.1] Vtx is a matrix, the elements of which have the following properties:  Vtx i, i   xti  1  xti Vtx i, j    xti  xtj  [6.1.2.2] It follows that the current recorded at time t has two sources of variance: baseline noise and state excess noise (equation [6.1.1.7]), and state fluctuations around the mean (equation [6.1.2.1]). If the multinomial distribution is approximated with a Gaussian (the approximation is good if N C  xti  1 ), all variance has Gaussian sources. Therefore, the current It can be approximately described with a Gaussian distribution, conditional on a given multinomial distribution of the expected state vector Xt, as follows: 251  t   0  N C  Xt  μ T  [6.1.2.3] Vt  V 0  NC  μT  Vtx  μ  Xt  V T  [6.1.2.4] N C μ T Vtx μ is the linear transformation of the VtMN covariance matrix from the NS x NS state space into the unidimensional measurement space. 6.2 Algorithm for ensemble hidden Markov model As previously mentioned, the algorithm relies on the assumption that the state vector has no temporal correlation. In other words, the state at any time t is assumed to be a function of the initial state only. For a given record, this is certainly not true, as the state at time t is always a function of the previous state at time t-1 (it is a Markov process), regardless of the number of channels. Determined by the properties of Markov processes, a fit line that minimizes the sum of square errors may underestimate the variance around the mean. Since a lower variance can be explained, for example, by a lower NC (see equation [6.1.2.1]), the kinetics and the channel count estimates will be biased. The temporal correlation can be reduced if several data sets are analyzed together. At the limit, the correlation disappears for an infinite number of data sets. In each data set, the state at time t is still a function of the previous state, but, over all data sets, it is (approximately) a function of the initial state only. The requirement that several data sets be analyzed together is easily accommodated by suitable experimental design. In fact, it is always desirable to estimate parameters from multiple data. 252 6.2.1 Likelihood function The likelihood function to be optimized is the probability of observing the data sequence Y, given the model parameters : L  pY | θ [6.2.1.1] The data sequence Y is the recorded current. Since, as discussed, there is no temporal correlation between data points, the probability of observing the whole data sequence Y is equal to the product of the individual observation probabilities: T L   p yt | θ  [6.2.1.2] t 0 The objective is to find the parameter ML which maximizes the likelihood: T  θ ML  arg max  p yt | θ    t 0  [6.2.1.3] The likelihood may be a very small or a very big number. It is then preferable to maximize the logarithm of the likelihood function (the logarithm of a function has the same maximum as the function): T LL   ln p yt | θ  t 0 253 [6.2.1.4] T  θ ML  arg max  ln p yt | θ    t 0  [6.2.1.5] Since the observation probability p(yt|) is a function of time, I will write it as pt(yt|). The observation probability pt(yt|) is a Gaussian: pt  yt | θ  N  t ,Vt  [6.2.1.6] The mean t and variance Vt are as discussed (equations [6.1.2.3] and [6.1.2.4]). 6.2.2 Parameterization and constraints I use the same parameterization and constraints as in Chapter 3.1.2. If the number of channels NC is optimized, then it must be under the constraint that it is positive. I define the variable u as: u  ln N C [6.2.2.1] Obviously, a lower limit of the channel count estimate can be directly obtained from data, by dividing the peak of the observed signal to the single channel amplitude. In this case, the parameterization should be modified as follows: u  ln N C [6.2.2.2] N C  N Cmin  N C [6.2.2.3] 254 The estimated channel count, NC, is a sum between a minimum count, NCmin, which is a constant, and an excess count NC+, which is variable and optimized. In the present work, I did not use a lower limit, yet the algorithm performed satisfactorily. However, this issue needs to be further explored. 6.2.3 Computation and derivatives of the likelihood function All the necessary quantities for the likelihood function are now available. The observation probability in expression [6.2.1.6], its natural logarithm, and the derivative of the logarithm with respect to a variable z are the following:  1  yt   t 2   pt  y t | θ   exp   2  V 2  Vt t   1 [6.2.3.1] 1 1 1  yt   t  ln pt  yt | θ    ln2   lnVt   2 2 2 Vt 2 [6.2.3.2] 2  1 Vt  yt  t  t 1  yt  t  Vt   ln pt  yt | θ       z 2Vt z  Vt  z 2  Vt  z [6.2.3.3] The log-likelihood function in expression [6.2.1.4] and its derivative with respect to a variable z are: T 1 1 T 1 T y  t  ln2    lnVt    t 2 2 t 0 2 t 0 Vt 2 LL   255 [6.2.3.4]  yt   t LL 1  1 Vt         z 2 t Vt z  t  Vt   t  1  yt   t       z  2 t  Vt 2  Vt     z  [6.2.3.5] The mean and variance of the observation probability, t and Vt are calculated with expressions [6.1.2.3] and [6.1.2.4]. In the following, I show how to calculate their derivatives, together with other auxiliary quantities and their derivatives with respect to a variable z: t N C X T   Xt  μ  N C  t  μ z z z T   T   Vt N C V x Xt T   μT  Vtx  μ  Xt  V  N C   μT  t  μ   V  z z z z   [6.2.3.6] [6.2.3.7] Xt  X0  At [6.2.3.8] At  expQ  t  [6.2.3.9] Xt X0 T A   At  X0  t z z z [6.2.3.10] T T T T At is calculated from Q using the spectral decomposition technique, indicated in Chapter 3. The derivatives of At and Xt with respect to a variable z (a macroscopic rate qij) are also given in Chapter 3. The derivative of the Vtx matrix is another matrix with elements:   Vtx i ,i  1  2  xti  Xt i z z [6.2.3.11] Vtx i , j    Xt i xtj  xti  Xt  j  z z z [6.2.3.12] 256 One may notice that the log-likelihood function in expression [6.2.3.4] reduces to a sum of square errors if the constant terms and the variance-containing terms are eliminated: SS  1 T  y t   t 2  2 t 0 [6.2.3.13] 6.3 Results I tested the algorithm with simulated data, generated by the three state model in Figure [6.3.1]. The model has up to five free parameters to be optimized: four rate constants k12, k21, k23 and k32 and the channel count NC. The standard deviation of the baseline noise is 0.2 pA. The baseline is 0 pA and it is constant. The excess noise of the open conductance class has a standard deviation of ~0.2236. The excess amplitude of the open class is 1 pA. 1 500.0 200.0 2 100.0 50.0 3 Figure [6.3.1]. Model used for simulation. 6.3.1 Bias of kinetics estimation The most important characteristic of an optimization algorithm is the bias of the estimate. I implemented the algorithm with the option of using as cost function either the likelihood in expression [6.2.3.4], or the sum of square errors in [6.2.3.13]. Due to the various approximations made, it seems possible that the likelihood version may present a certain bias. The sum of squares 257 version, however, is expected to have no bias at all. Obviously, without the variance-containing terms, it is not possible to estimate channel count. To determine if kinetics estimation is biased, a theoretically infinite amount of data should be analyzed. Instead, I emulate the infinite data prerequisite with a deterministic simulation, which has neither measurement noise, nor channel fluctuations. The parameters of the simulation are given in Table [6.3.1]. The kinetics was estimated with the same model as in Figure [6.3.1], with the true conductance parameters and channel count, and the rate constants initialized at their true values. For comparison, I also generated stochastically simulated data. 1000 simulated data sets were averaged and rate constants were estimated from the average. PtsPerSeg SegCnt 4000 1 Sampling [kHz] 50 Channel Count 1000 Starting State 1 Table [6.3.1]. Simulation parameters. PtsPerSeg is the number of points in a data segment, and SegCnt is the number of segments. The results in Figure [6.3.2] show that the two cost functions give practically the same accurate estimates, hence the algorithm appears to be unbiased when the kinetics alone is estimated. Figure [6.3.3] gives a visual measure of the variance around the fitted line. Panel A shows a single data set, and panel B shows 10 overlapped data sets, fitted together. The fit line is obtained with the log-likelihood cost function. The variance of data around the fitted curve is visibly smaller in panel A than in B. Since the estimation of channel count is based on measuring the variance around the conditional mean, an underestimated variance may lead to an underestimated channel count. Figures [6.3.4] and [6.3.5] show the distribution of the estimate for each rate constant, when finite amounts of data are analyzed. 100 stochastically simulated data sets are generated 258 with the parameters in Table [6.3.1], and optimized individually for rate constants (channel count is fixed at the true value). For either the log-likelihood or the sum of squares, the average rates are very close to the true values, and have a Gaussian distribution. 6.3.2 Simultaneous estimation of kinetics and channel count With the properties of the kinetics estimation established, I tested the simultaneous estimation of kinetics and channel count. The log-likelihood function is used in all runs where the channel count is a free parameter. 100 stochastically simulated data sets are generated with the parameters in Table [6.3.1], and optimized individually for rate constants and channel count. Figure [6.3.6] shows the results. As expected, the channel count cannot be reliably obtained from a single data set and is clearly underestimated. All optimized parameters are extremely scattered, especially k21. Whereas channel count is underestimated, k12 is overestimated, and the two are strongly correlated. When several data sets are used together, the estimation improves considerably. Figure [6.3.7] shows results for 1000 data sets simulated as before and optimized in groups of 10. The scatter, as well as the bias of the estimates is much smaller. There is strong negative correlation between channel count and k12, and positive correlation between channel count and k21 and k23. Obviously, the peak current can be explained either by faster rates leading to the open states or by higher number of channels, mutually exclusive. Optimizing even more data sets together gives even better result (not shown), in terms of scatter and bias. The next experiment determines the range of channel count within which the algorithm can give reliable estimates. 1000 data sets were simulated as before and optimized in groups of 10. The channel count was 10, 100, 1000 and 10000, respectively. Figure [6.3.8] shows the optimization results. The algorithm can optimize the channel count over a wide range, but gives better estimates for larger values. 10 channels is certainly an extreme case that is better 259 approached with single channel methods. As expected, the scatter of the estimates is large for 10 channels, but becomes quite small for 1000 or 10000 channels. 6.3.3 Effect of conductance parameters The tests were so far conducted using the true values of the conductance parameters. In practice, these are known only approximately, and thus it is important to determine how sensitive the algorithm is with respect to them. Figure [6.3.9] shows the effects the standard deviation of the baseline noise and of the open class excess noise have on the kinetics and channel count estimates. It appears that the baseline standard deviation has practically no effect (panel A). Evidently, the noise of a record with 1000 channels has only one part in a 1000 contribution from baseline. The open class noise (defined as baseline noise plus open class excess noise) has a more serious effect. The graph in Figure [6.3.9], panel B, indicates that underestimation of open class noise is slightly better than overestimation. The individual channel amplitude (excess conductance of open class) is tied with the channel count parameter (results not shown). The same current can be the result of a larger channel count and a smaller single channel amplitude or, alternatively, of a smaller channel count and larger amplitude. The two situations can be distinguished, to some extent, from the variance. That is, of course, if the other conductance parameters are known. Further work is necessary to determine whether the optimizer can simultaneously estimate kinetics, channel count and (some of) the conductance parameters. 6.3.4 Effect of channel count 260 Figure [6.3.10] shows the effect of the channel count parameter (not optimized) upon kinetics estimation, when the single channel amplitude is known. Whatever the optimization method, different values for channel count give different sets of rate constants. 10 data sets are optimized together. Panel A shows the rate estimates using the log-likelihood cost function. For approximately +/-30% variation in channel count, some rates span almost an order of magnitude range. In practice, the error in guessing the number of channel in the patch could be even larger. Panel B shows the curvature of the log-likelihood function with respect to the channel count value. The function is smooth and does have a single maximum, although the peak area is quite shallow. If the sum of squares cost function is used instead of the log-likelihood, exactly the same fit (sum of square errors) can be obtained with different values of channel count and rate constants. This implies that identical trajectories can be generated by different sets of rate constants, given different values of the channel count. As mentioned previously, the relaxation of the macroscopic current is a sum of weighted exponentials. Clearly, identical trajectories – up to a scaling factor – can be generated by identical sets of eigenvalues, and different sets of eigenvectors, that can be given by suitably different Q matrices. The scaling factor can be accounted for by a larger or a smaller channel count. Figure [6.3.11] is an example of two Q matrices that have identical eigenvalues but different eigenvectors, and that, when associated with a channel count of 10000 (true value) and 8000, respectively, give identical fit lines (and thus identical sum of squares). 6.3.5 Conditioning-test experiments and estimation of starting probabilities Rate constants can be estimated from macroscopic non-equilibrium data only, since infinitely many different combinations of rate constants give the same equilibrium current. The ion channels must therefore start with a non-equilibrium state occupancy and relax towards 261 equilibrium. Hence, the starting state probabilities are a critical parameter of the algorithm. In the experiments described so far, the simulation started with all channels in the C1 state. The initial state probabilities were PC1=1, PO=0, PC2=0, while the equilibrium probabilities are 0.118, 0.294 and 0.588. Experimentally, the channels can be forced into a non-equilibrium distribution only if one or several rates are dependent on external, controllable factors, such as ligand concentration, voltage etc. This can be accomplished with a conditioning-test experiment, as presented in Figure [6.3.12]. Let us assume that the model used so far has the k12 rate ligand dependent. The macroscopic rate k12 is a function of a pre-exponential, constant factor k120 multiplied with the ligand concentration: k12  k120  Ligand  [6.3.1] The pre-exponential factor k120 is optimized, together with the other rate constants and channel count. A conditioning pulse applies the ligand in a given concentration, for long enough as to allow the channels to reach equilibrium. In Figure [6.3.12], the conditioning concentration is 0.13, thus making the macroscopic rate k12 equal to 50. The conditioning equilibrium state probabilities are 0.571, 0.143 and 0.286. After the conditioning pulse, a test pulse of a different ligand concentration is applied. In Figure [6.3.12] the test ligand concentration is 1. The macroscopic rate k12 becomes 500, and the equilibrium probabilities are 0.118, 0.294 and 0.588. The parameters are estimated from the test part of the recorded data. Obviously, since the kinetics is not known, the initial state probabilities are not known either. However, they can be calculated as the equilibrium probabilities of the conditioning experimental conditions, using the formalism described in Chapter 3. Figure [6.3.13] shows that the rates and channel count estimated from the conditioning-test experiment have actually less 3 I ignore the concentration units, as the rate constants and the concentrations scale together. 262 scatter than those obtained with the starting probabilities fixed (and known). The results in the figure demonstrate that the algorithm is capable of optimizing data from conditioning-test experiments. 6.3.6 Estimating channel count The last experiment shows how Mac can be used to estimate the channel count from equilibrium data, regardless of kinetics (Figure [6.3.14]). The length of the simulated data is 10000 points (instead of 4000 so far), thus giving more data at equilibrium. The rate constants of a simplified C-O model are estimated simultaneously with the channel count. The contribution to the log-likelihood function of the first data points (black section in the figure) is ignored, and only the equilibrium data (the red section) is used. The effect is twofold. First, the starting probabilities can be chosen arbitrarily, as the fitted portion of the data would have reached equilibrium anyway. Second, a simple, two-state model can be used, since the relaxation towards equilibrium is contained in the non-equilibrium section of the data, which is ignored. From 10 data sets optimized together, the algorithm finds any pair of rates and a channel count that best explain the average and the variance of the current at equilibrium. As mentioned before, the kinetics is undetermined with equilibrium data. The only limitation imposed on the estimated rates is that they allow reaching equilibrium during the skipped part. Panels A and B in Figure [6.3.14] show that the channel count estimate is accurate, with moderate scatter. Using more then 10 data sets together is likely to reduce the scatter and increase the accuracy (although the simplified model is not necessarily an unbiased estimator). In order to eliminate the shortcoming of rate estimates varying wildly and causing numerical problems, I tested the method with one rate fixed and the other one a free parameter. The only restriction is, again, that the fixed rate in combination with a variable rate allows the channel to reach equilibrium in time. With the forward rate thus fixed, the backward rate has to be estimated such as to explain the average 263 current, and it cannot vary too much. Panels C and D in Figure [6.3.14] show the results to be quasi-identical, but with slightly less scatter. 264 A 1 C Log-likelihood 499.45 199.41 2 100.1 49.97 3 1 B 499.67 200.85 2 99.75 49.8 3 D Sum of squares 1 Log-likelihood 500.01 200.0 2 100.0 50.0 Sum of squares 3 1 500.24 201.49 2 99.64 49.83 3 Figure [6.3.2]. Bias of kinetics estimation. Simulated data sampled at 50kHz, channel count 1000, 4000 points per trace. A, B Rate constants optimized from deterministic simulation. C, D Rate constants optimized from average of 1000 data sets, generated as stochastic simulations. A, C The likelihood function used as cost function. B, D The sum of squares used as cost function. 265 A B 50 pA 50 pA 5 ms 5 ms 1 data set 10 data sets Simulated data True line Fit line Simulated data True line Fit line Figure [6.3.3]. Simultaneous rate constants and channel count estimation. Sampling 50kHz, channel count 1000, 4000 points per trace. A One data set optimized alone. B 10 data sets optimized together. The 10 data sets are drawn on top of each other. Also overlapped are the deterministic simulation with the true parameters and the fitted curves for each case. 266 Sum of squares estimation A K12 K21 K23 K32 100 0 20 40 60 80 100 B Count Data set 6 4 2 0 Avg 512.2 Std 34.71 450 500 K12 550 C Count 6 4 Avg 211.7 2 Std 59.45 0 D Count 100 150 Count 250 300 6 4 2 0 Avg 98.64 Std13.29 80 E 200 K21 100 K23 120 6 4 2 0 Avg 49.27 Std 6.306 40 45 50 K32 55 60 Figure [6.3.4]. Rate constants estimation using the sum of squares. Stochastic simulation. Data sets optimized separately. Sampling 50kHz, channel count 1000, 4000 points per trace. A Logarithmic scatter plot of estimates. B k12. C k21. D k23. E k32. 267 Log-likelihood estimation A K12 K21 K23 K32 100 0 20 40 60 80 100 Data set B Count 6 4 Avg 515.1 2 Std 34.34 0 450 500 550 C Count K12 8 6 4 2 0 Avg 215.2 Std 59.17 D Count 100 150 Count 250 300 6 4 2 0 Avg 98.51 Std 13.53 80 E 200 K21 100 K23 120 6 4 2 0 Avg 49.22 Std 6.562 40 45 50 K32 55 60 65 Figure [6.3.5]. Rate constants estimation using the log-likelihood function. Stochastic simulation. Data sets optimized separately. Sampling 50kHz, channel count 1000, 4000 points per trace. A Logarithmic scatter plot of estimates. B k12. C k21. D k23. E k32. 268 1000 CC, 1 data set A CC K12 K21 K23 K32 1000 100 0 20 40 60 80 100 B Count Data set 40 Avg 720.9 20 Std 42.89 0 C Count 1,000 2,000 Channel count 3,000 10 Avg 757.4 5 Std 89.63 0 200 400 600 K12 800 D Count 60 40 Avg 13.46 20 Std 127.1 0 0 100 200 300 E Count K21 20 Avg 69.45 10 Std 12.44 0 F Count 100 200 K23 300 6 4 2 0 Avg 49.13 Std 6.324 40 45 50 K32 55 60 Figure [6.3.6]. Simultaneous rate constants and channel count estimation. Each data set optimized separately. Sampling 50kHz, channel count 1000, 4000 points per trace. A Logarithmic scatter plot of estimates. B Channel count. C k12. D k21. E k23. F k32. 269 1000 CC, 10 data sets A CC K12 K21 K23 K32 1000 100 0 20 40 60 80 100 B Count Data sets 15 40 10 20 5 Avg 926.2 Std 113.2 0 C Count 1,000 1,000 Avg 558.5 Std 74.36 Count Count 400 Count 600 700 Avg 178.4 Std 73.23 150 200 K21 250 300 15 10 5 0 Avg 91.5 Std 11.38 100 F 500 K12 6 4 2 0 100 E 3,000 8 6 4 2 0 300 D 2,000 1,500 Channel count 150 K23 200 8 6 4 2 0 Avg 49.74 Std 1.894 46 48 50 K32 52 54 Figure [6.3.7]. Simultaneous rate constants and channel count estimation. 10 data sets optimized together. Sampling 50kHz, channel count 1000, 4000 points per trace. A Logarithmic scatter plot of estimates. B Channel count. C k12. D k21. E k23. F k32. 270 A B 10 CC 1000 CC K12 K21 K23 K32 100 100 CC CC K12 K21 K23 K32 1000 100 10 0 20 40 60 80 100 0 20 Data sets C 40 60 80 100 Data sets D 1000 CC CC K12 K21 K23 K32 1000 10000 CC CC K12 K21 K23 K32 10000 1000 100 100 0 20 40 60 80 100 0 Data sets 20 40 60 80 100 Data sets Figure [6.3.8]. Simultaneous rate constants and channel count estimation. 10 data sets optimized together. Sampling 50 kHz, 4000 points per trace. A Channel count 10. B Channel count 100. C Channel count 1000. D Channel count 10000. 271 A 1000 CC K12 K21 K23 K32 100 0.0 B 0.2 0.4 0.6 0.8 1.0 Baseline noise std 1000 CC K12 K21 K23 K32 100 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Open class noise std (excess + baseline) Figure [6.3.9]. Effect of conductance parameters. Different values of the baseline and of the open class noise standard deviation. Simultaneous rate constants and channel count estimation. 10 data sets optimized together. Sampling 50kHz, channel count 1000, 4000 points per trace. A Logarithmic plot of estimates vs baseline noise. B Logarithmic plot of estimates vs open class noise. The open class noise is defined as baseline plus open class excess noise. The dotted vertical lines mark the true values. 272 A 1000 K12 K21 K23 K32 100 10 1 0.1 0.01 600 B 800 1000 1200 1400 1600 Channel count Log-likelihood -180000 -184000 LL -188000 -192000 600 800 1000 1200 1400 1600 Channel count Figure [6.3.10]. Effect of channel count parameter on rate constants estimation. 10 data sets optimized together. Sampling 50kHz, channel count 1000, 4000 points per trace. A Logarithmic plot of rate estimates vs channel count. B Plot of the log-likelihood vs channel count. 273 A 10000 channels (true value) -495.262   Q10000 :=  193.385   0  eigenvalues:    -294.521 101.136    49.9512 -49.9512 495.262 0 0.0, -116.88, -722.85 eigenvectors: -0.6710, -0.6710, -0.6710 -0.7951, -0.6074, 0.4533 0.9082, -0.4174, 0.3098 B 8000 channels -619.078   Q8000 :=  91.5716   0  eigenvalues: 619.078 -170.705 49.9512    79.1337    -49.9512 0 0.0, -116.88, -722.85 eigenvectors: -0.6878, -0.6878, -0.6878 -0.8733, -0.7084, 0.5287 0.9862, -0.1653, 0.0123 Figure [6.3.11]. Identical trajectories can be accounted for by different Q matrices with different channel counts. 10 traces of stochastically simulated data, with 10000 channels, were generated and optimized together for rate constants, using the sum of squares cost function. Channel count was fixed at either 10000 (panel A), or 8000 (panel B). The two obtained Q matrices have identical eigenvalues but different eigenvectors. When coupled with the corresponding channel count (10000 or 8000), the two Q matrices give identical fit lines, thus identical sum of squares. 274 A B Conditioning equilibrium Test equilibrium state probabilities state probabilities 0.571 0.143 50.0 1 200.0 100.0 2 50.0 0.286 0.118 3 1 0.294 500.0 200.0 2 0.588 100.0 50.0 3 [Ligand] = 1.0 C Test [Ligand] = 0.1 Conditioning Current baseline D 1 K12 K23 500.0 100.0 200.0 K21 2 50.0 K12 = K120 .[Ligand] 3 50 pA 5 ms K120 = 500 K32 Figure [6.3.12]. Conditioning-test experiment. The test initial state probabilities are calculated as the equilibrium probabilities of the conditioning experimental conditions. 10 data sets optimized together. Sampling 50kHz, channel count 1000, 4000 points per trace. A Conditioning model and equilibrium probabilities. B Test model and equilibrium probabilities. C Stimulus waveform and response. 10 data sets overlapped. D Macroscopic rate k12 is a function of the microscopic rate k120 and the ligand concentration. k120 is optimized. 275 Conditioning-Test experiment A CC K12 K21 K23 K32 1000 100 0 20 40 60 80 100 B Count Data sets 8 6 4 2 0 Avg 982.4 Std 48.88 C Count 900 Count 1,200 8 6 4 2 0 Avg 522.9 Std 44.51 400 D 1,000 1,100 Channel count 450 500 K12 550 600 6 4 2 0 Avg 193 Std 17.72 160 180 200 220 E Count K21 6 4 2 0 Avg 94.53 Std 13.49 80 100 120 140 F Count K23 6 4 2 0 Avg 48.24 Std 3.883 42 44 46 48 50 52 54 56 K32 Figure [6.3.13]. Conditioning-Test experiment. The test initial state probabilities are calculated as the equilibrium probabilities of the conditioning experimental conditions. 10 data sets optimized together. Sampling 50kHz, channel count 1000, 4000 points per trace. A Logarithmic scatter plot of estimates. B Channel count. C k12. D k21. E k23. F k32. 276 A Count 1000 C 0 20 40 60 80 100 Data set Channel count 6 4 2 0 Avg 941.2 Std 151.3 1,000 Channel count Count Channel count B D 8 6 4 2 0 Avg 943.3 Std 142.6 1,000 Channel count F 1000 0 20 40 60 80 1,500 1,500 100 pA 10 ms 100 Data set E 1 100.0 50.0 2 Figure [6.3.14]. Channel count estimation with a simplified model. 10 data sets optimized together. Sampling 50kHz, channel count 1000, 10000 points per trace. A, B Scatter plot and statistics of channel count estimated with model in E. C, D Scatter plot and statistics of channel count estimated with model in E, with k12 fixed. E Simplified two-state model, starting in the open state. F The optimizer skips the first 50 ms in each trace (black section) and fit the remaining (red section). 277 6.4 Discussion The method proposed here has several advantages over existing methods, the most important being that the algorithm estimates directly rate constants and not time constants. Furthermore, the algorithm can estimate simultaneously the kinetics and the number of channels, which is a novel procedure. If linear relations between rate constants are known to exist, then linear constraints can be imposed on estimation, and the number of free parameters is thus reduced. The kinetics estimation alone is not biased. However, when the channel count is optimized too, the bias, although small, depends on how many records from the same patch are available. One record only will result in up to 30% bias for channel count and similar for rates. With ten records, the bias is down to about 5%. Also, the bias depends on how many channels are in the patch. With 10-100 channels, the bias is relatively large, whereas with channel count above 1000 the bias is very small. Alternatively, the channel count alone can be estimated with a simplified model from equilibrium data. The algorithm estimates directly rate constants. Their identifiability is, however, determined by the time constants of the multi-exponential relaxation. If a component of the multiexponential is very fast relative to the sampling time it will likely not be detected. Similarly, if a component is too slow relative to the data length, it will not be detected either. Furthermore, the weight of a component must be significant. Further work is required to clarify these matters. I have successfully tested Mac with other models (results not shown). Also, the optimization was initialized with parameters other than the true values (results not shown). The algorithm is relatively robust to initial values, but less so than single channel algorithms. Further work is necessary to establish the behavior of the algorithm with experimental data, which was not available to me at this time. 278 References Aitkin, M. and I. Aitkin. 1996. A hybrid EM/Gauss-Newton algorithm for maximum likelihood in mixture distributions. Statistics and Computing. 6:127-130. Albertsen, A. and U. P. Hansen. 1994. Estimation of kinetic rate constants from multi-channel recordings by a direct fit of the time series. Biophys. J. 67:1393-1403. Balakrishnan, A. V. 1984. Kalman Filtering Theory. University Series in Modern Engineering. Optimization Software, New York. Ball, F. G., and J. A. Rice. 1989. A note on single-channel autocorrelation functions. Math. Biosci. 97:17-26. Ball, F. G., and J. A. Rice. 1992. Stochastic models for ion channels: introduction and bibliography. Math. Biosci. 112:189-206. Ball, F. G., and M. S. P. Sansom. 1988. Aggregated Markov processes incorporating time interval omission. Adv. Appl. Prob. 20:546 –572. Ball, F. G., and M. S. P. Sansom. 1989. Ion-channel gating mechanisms: model identification and parameter estimation from single channel recordings. Proc. R. Soc. Lond. B. 236:385– 416. Ball, F. G., G. F. Yeo, R. K. Milne, R. O. Edeson, B. W. Madsen, M. S. P. Sansom. 1993. Single ion channel models incorporating aggregation and time interval omission. Biophys. J. 64:357374. Ball, F. G., Y. Cai, J. B. Kadane and A. O’Hagan. 1999. Bayesian inference for ion channel gating mechanisms directly from single channel recordings, using Markov chain Monte Carlo. Proc. R. Soc. Lond. A. 455:2879-2932. Baum, L. E., T. Petrie, G. Soules, and N. Weiss. 1970. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Stat. 41:164 –171. 279 Baumgartner, W., H. Schindler and C. Romanin. 1998. Removing non-random artifacts from patch clamp traces. J. Neurosci. Meth. 82:175:186. Bayes, T. 1783. An Essay Towards Solving a Problem in the Doctrine of Chances. Phil. Trans. Roy. Soc. 53:370-418. Berger, J. 1985. Statistical Decision Theory and Bayesian Analysis (2nd edition), New York: Springer-Verlag. Billio, M., A. Monfort and C. P. Robert. 1998. Bayesian estimation of switching ARMA models. Technical report, CREST, INSEE, Paris. Blatz, A. L. and K. L. Magleby. 1986. Correcting single channel data for missed events. Biophys. J. 49:967-980. Block, S. M., L. S. B. Goldstein and B. J. Schnapp. 1990. Bead movement by single kinesin molecules studied with optical tweezers. Nature 348:348-352. Box, G and G. Tiao. 1973. Bayesian Inference in Statistical Analysis, Reading: Addison-Wesley. Boyen, X. and D. Koller. 1998. Tractable inference for complex stochastic processes. Proc. of the 14th Ann. Conf. on Uncertainty in AI (UAI), Madison, Wisconsin, pp33-42. Carter, C. K. and R. Kohn. 1996. Markov chain Monte Carlo in conditionally Gaussian state space models. Australian Graduate School of Management, University of New South Wales. Chung, S. H., J. B. Moore, L. G. Xia, L. S. Premkumar, and P. W. Gage. 1990. Characterization of single channel currents using digital signal processing techniques based on hidden Markov models. Proc. R. Soc. Lond. B. 329:265–285. Chung, S. H., V. Krishnamurthy, and J. B. Moore. 1991. Adaptive processing techniques based on hidden Markov models for characterizing very small channel currents buried in noise and deterministic interferences. Proc. R. Soc. Lond. B. 334:357–384. Colquhoun, D., A. G. Hawkes and K. Skrodzinski. 1996. Joint distributions of apparent open and shut times of single-ion channels and maximum likelihood fitting of mechanisms. Phil. Trans. R. Soc. Lond. A. 354:2555-2590. 280 Colquhoun, D., and A. G. Hawkes. 1977. Relaxation and fluctuations of membrane currents that flow through drug-operated channels. Proc. R. Soc. Lond. B.199:231-262. Colquhoun, D., and A. G. Hawkes. 1981. On the stochastic properties of single ion channels. Proc. R. Soc. Lond. B. 211:205–235. Colquhoun, D., and A. G. Hawkes. 1987. A note on correlations in single channel records. Proc. R. Soc. Lond. B. 230:15–52. Colquhoun, D., and A. G. Hawkes. 1995. The principles of the stochastic interpretation of ion channel mechanisms. In Single channel recording (2nd ed). B. Sakmann and E. Neher, editors. Plenum Press, New York. ???-???. Colquhoun, D., and F. J. Sigworth. 1995a. A Q-Matrix cookbook. In Single channel recording (2nd ed). B. Sakmann and E. Neher, editors. Plenum Press, New York. 589–633. Colquhoun, D., and F. J. Sigworth. 1995b. Fitting and statistical analysis of single-channel records. In Single channel recording (2nd ed). B. Sakmann and E. Neher, editors. Plenum Press, New York. 483-585. Cox, D. R., and H. D. Miller. 1965. The Theory of Stochastic Processes. Methuen, London. Crouzy, S. C. and F. J. Sigworth. 1990. Yet another approach to the dwell-time omission problem of single-channel analysis. Biophys. J. 58:731-743. De Gunst, M. C. M., H. R. Künsch and J. G. Schouten. 2001. Statistical analysis of ion channel data using hidden Markov models with correlated state-dependent noise and filtering. J. Amer. Stat. Assoc. 96:805-815. Dempster, A. P., N. M. Laird and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. B. 39:1-38. Doucet, A. and C. Andrieu. 1999. Iterative Algorithms for Optimal State Estimation of Jump Markov Linear Systems. Proc. Conf. IEEE ICASSP. Draber, S. and R. Schultze. 1994. Detection of jumps in single-channel data containing subconductance levels. Biophys. J. 67:1404-1413. 281 Elliott, R. J., L. Aggoun and J. B. Moore. 1995. Hidden Markov models: Estimation and control. New York: Springer-Verlag. Fessler, J. A. and A. O. Hero. 1994. Space-Alternating Generalized Expectation-Maximization Algorithm. IEEE Trans. Sign. Proc. 42(10):2664-2677. Fletcher, R. 1981. Practical Methods of Optimization. John Wiley & Sons, Chichester. Fletcher, R. and M. J. D. Powell. 1963. A rapidly convergent descent method for minimization. Computing Journal. 2:163–168. Forney, G. D. 1973. The Viterbi algorithm. Proc. IEEE. 61:268-278. Fredkin, R. F. and J. A. Rice. 1986. On aggregated Markov processes. J. Appl. Prob. 23:208-214. Frey, B. J. and D. J. MacKay. 1997. A revolution: Belief propagation in graphs with cycles. NIPS 10. Gauvain, J.-L. and C.-H. Lee. 1992. MAP Estimation of Continuous Density HMM: Theory and Applications. Proc. DARPA Speech & Nat. Lang., Morgan Kaufmann. Gelman, A., J. B. Carlin, H. S. Stern and D. B. Rubin. 1995. Bayesian Data Analysis, London: Chapman and Hall. Ghahramani, Z. 2001. An Introduction to Hidden Markov Models and Bayesian Networks. International Journal of Pattern Recognition and Artificial Intelligence. 15(1):9-42. Ghahramani, Z. and G. Hinton. 1996. Switching state-space models. Technical Report CRG-TR96-3, Dept. Comp. Sci., Univ. Toronto. Goldstein, L. S. B. and A. V. Philp. 1999. The Road less traveled: Emerging principles of kinesin motor utilization. Ann. Rev. Cell Dev. Biol. 15:141-183. Hawkes, A. G., A. Jalali and D. Colquhoun. 1990. The distribution of the apparent open times and shut times in a single channel record when brief events cannot be detected. Phil. Trans. R. Soc. Lond. A. 332:511–538. Hille, B. 1992. Ionic Channels of Excitable Membranes. Sinauer Associates, Inc., Sunderland, 2 nd edition. 282 Horn, R., and K. Lange. 1983. Estimating kinetic constants from single channel data. Biophys. J. 43:207–223. Howard, J., A. J. Hudspeth and R. D. Vale. 1989. Movement of microtubules by single kinesin molecules. Nature. 342:154-158. Jackson, M. B. 1997. Inversion of Markov processes to determine rate constants from singlechannel data. Biophys. J. 73:1382-1394. Jamshidian, M. and R. I. Jennrich. 1993. Conjugate gradient acceleration of the EM algorithm. J. Am. Stat. Assoc. 88:221-228. Jeney, S., F. Sachs, A. Pralle, E. L. Florin and H. J. K. Horber. 2000. Statistical analysis of kinesin kinetics by applying methods from single chanel recordings. Biophysical J 78, 717Pos. Jensen, F. V. 1996. Introduction to Bayesian Networks. Springer-Verlag, New York. Jordan, M. I. (ed.) 1998. Learning in Graphical Models, Cambridge: MIT Press. Juang, B. H. and L. R. Rabiner. 1991. Hidden Markov models for speech recognition. Technometrics. 33:251-272. Kalman, R. E. 1960. A new approach to linear filtering and prediction problems. Trans. Am. Soc. Mech. Eng., Series D, Journal of Basic Engineering. 82:35–45. Kalman, R. E. and R. S. Bucy. 1961. New results in linear filtering and prediction theory. Trans. Am. Soc. Mech. Eng., Series D, Journal of Basic Engineering. 83:95–108. Keleshian, A. M., G. F. Yeo, R. O. Edeson and B. W. Madsen. 1994. Superposition properties of interacting ion channels. Biophy. J. 67:634-640. Kemeny, J. G., J.L. Snell and A.W. Knapp. Denumerable Markov Chains. D. Van Nostrand Company, 1966. Kim, C.-J. 1994. Dynamic linear models with Markov-switching. J. of Econometrics. 60:1-22. Klein, S., J. Timmer and J Honerkamp. 1997. Analysis of multichannel patch clamp recordings by hidden Markov models. Biometrics. 53:870-884. 283 Koller, D., U. Lerner and D. Angelov. 1999. A general algorithm for approximate inference and its application to hybrid Bayes nets. Uncertainty in AI. pp324-333. Lange, K. 1995. A quasi-Newton acceleration of the EM algorithm. Statistica Sinica. 5:1-18. Lauritzen, S. L. 1992. Propagation of probabilities, means and variances in mixed graphical association models. J American Statistical Association. 87:1098–1108. Lauritzen, S. L. 1996. Graphical models, London: Oxford University Press. Little, R. J. A. and D. B. Rubin. 1987. Statistical analysis with missing data. NewYork: Wiley. Liu, C. and D. B. Rubin. 1994. The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence. Biometrika. 81:633-648. Liu, C., D. B. Rubin and Y. N. Wu. 1998. Parameter expansion to accelerate EM: The PX-EM algorithm. Biometrika. 85:755-70. Logothetis, A. and V. Krishnamurthy. 1999. Expectation Maximization Algorithms for MAP Estimation of Jump Markov Linear Systems. IEEE Trans. Sign. Proc. 47(8):2139-2156. Marschner, I. C. 2001. On stochastic versions of the EM algorithm. Biometrika. 88:281-286. McLachlan, G. J. and T. Krishnan. 1997. The EM Algorithm and Extensions. Wiley series in probability and statistics. John Wiley & Sons, Inc. Meilijson, I. 1989. A fast improvement of the EM algorithm on its own terms. J. Royal Stat. Soc. B. 51(1):127-138. Michalek, S. and J. Timmer. 1999. Estimating rate constants in hidden Markov models by the EM-algorithm. IEEE Trans. Sign. Proc. 47. 226-228. Minka, T. P. 2001. A family of algorithms for approximate Bayesian inference. PhD thesis. Massachusetts Institute of Technology, www.media.mit.edu/˜tpminka/. Moler, C. B. and C.F. Van Loan. 1978. Nineteen dubious ways to compute the exponential of a matrix. SIAM Rev. 20:801-836. Murphy, K., Y. Weiss and M. Jordan. 1999. Loopy-belief propagation for approximate inference: An empirical study. Uncertainty in AI. 284 Najfeld, I. and T. F. Havel. 1995. Derivatives of the matrix exponential and their computation. Advances in Applied Mathematics. 16:321-375. Okada, Y. and N. Hirokawa. 1999. A Processive Single-Headed Motor: Kinesin Superfamily Protein KIF1A. Nature. 283:1152-1157. Opper, M. and O. Winther. 1999. A Bayesian approach to on-line learning. On-Line Learning in Neural Networks. Cambridge University Press. Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. 1992. Numerical recipes in C. Cambridge University Press, Cambridge. Qin, F., A. Auerbach, and F. Sachs. 1996. Estimating single channel kinetic parameters from idealized patch-clamp data containing missed events. Biophys. J. 70:264 –280. Qin, F., A. Auerbach, and F. Sachs. 1997. Maximum likelihood estimation of aggregated Markov processes. Proc. R. Soc. Lond. B. 264:375–383. Qin, F., A. Auerbach, and F. Sachs. 2000a. A direct optimization approach to hidden Markov modeling for signal channel kinetics. Biophys. J. 79:1915–1927. Qin, F., A. Auerbach, and F. Sachs. 2000b. Hidden Markov Modeling for Single Channel Kinetics with Filtering and Correlated Noise. Biophys. J. 79:1928–1944. Rabiner, L. R. 1989. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE. 77(2):257-286. Rabiner, L. R. and B. H. Juang. 1986. An introduction to hidden Markov models. IEEE Ac. Sp. Sign. Proc. January, 4-16. Rabiner, L. R., B. H. Juang, S. E. Levinson and M. M. Sondhi. 1986. Recognition of isolated digits using hidden Markov models with continuous mixture densities. AT&T Tech. J. 64(6):1211–1222. Rauch, H. E. 1963. Solutions to the linear smoothing problem. IEEE Transactions on Automatic Control. 8:371–372. 285 Rauch, H. E., F. Tung and C. T. Striebel. 1965. Maximum likelihood estimates of linear dynamic systems. AIAA Journal. 3(8):1445–1450. Rice, S., A. W. Lin, D. Safer, C. L. Hart, N. Naber, B. O. Carragher, S. M. Cain, E. Pechatnikova, E. M. Wilson-Kubalek. M. Whittaker, E. Pate, R. Cooke, E. W. Taylor, R. A. Milligan and R. D. Vale. 1999. Nature. 402(6763):778-84. Rosales, R., J. A. Stark, W. J. Fitzgerald and S. B. Hladky. 2001. Bayesian Restoration of Ion Channel Records using Hidden Markov Models. Biophys. J. 80:1088-1103. Roux, B., and R. Sauve. 1985. A general solution to the time interval omission problem applied to single channel analysis. Biophys. J. 48:149 –158. Roweis, S and Z. Ghahramani. 1999. A Unifying Review of Linear Gaussian Models. Neural Computation. 11:305–345. Sachs, F., J. Neil and N. Barkakati. 1982. The automated analysis of data from single ionic channels. Pflügers Arch. 395:331-340. Sakmann, B. and E. Neher. 1992. The patch clamp technique. Scientific American, March. Sakmann, B. and E. Neher. 1995. Single channel recording (2nd ed), Plenum Press, New York. Shachter, R. 1990. A linear approximation method for probabilistic inference. Uncertainty in AI. Shumway, R. and D. Stoffer. 1991. Dynamic linear models with switching. J. of the Am. Stat. Assoc. 86:763-769. Silberberg, S. D., A. Lagrutta, J. P. Adelman and K. L. Magleby. 1996. Wanderlust kinetics and variable Ca2+Sensitivity of dSlo, a large conductance Ca2+-activated K+ channel, expressed in oocytes. Biophysical J. 70:2640-2651. Sorensen, H. W. (ed). 1985. Kalman filtering: theory and application. IEEE. Svoboda, K., C. F. Schmidt, B. J. Schnapp and S. M. Block. 1993. Direct observation of kinesin stepping by optical trapping interferometry. Nature 365:721-727. 286 Swope, S. L., S. I. Moss, L. A. Raymond and R. L. Huganir. 1999. Regulation of ligand-gated ion channels by protein phosphorylation. Adv. in Second Messenger & Phosphoprotein Res. 33:4978. Tugnait, J. K. 1982. Adaptive estimation and identification for discrete systems with Markov jump parameters. IEEE Trans. Automat. Contr. AC-27:1054–1065. Vale, R. D. and R. A. Milligan. 2000. The Way Things Move: Looking Under the Hood of Molecular Motor Proteins. Science. 288:88-95. Venkataramanan, L., J. L. Walsh, R. Kuc, and F. J. Sigworth. 1998. Identification of hidden Markov models for ion channel currents. I. Colored background noise. IEEE Trans. Signal Processing. 46:1901–1915. Venkataramanan, L., R. Kuc, and F. J. Sigworth. 1998. Identification of hidden Markov models for ion channel currents. II. State-dependent excess noise. IEEE Trans. Signal Processing. 46:1916 –1929. Venkataramanan, L., R. Kuc, and F. J. Sigworth. 2000. Identification of Hidden Markov Models for Ion Channel Currents. III. Bandlimited, Sampled Data. IEEE Trans. Signal Processing. 49:376-385. Visscher, K., M. J. Schnitzer and S. M. Block. 1999. Single kinesin molecules studied with a molecular force clamp. Nature 400:184-189. Viterbi, A. J. 1967. Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Trans. Informat. Theory. IT-13:260-269. Wagner, M., S. Michalek and J. Timmer. 1999. Estimating transition rates in aggregated Markov models of ion-channel gating with loops and with nearly equal dwell times. Proc. R. Soc. Lond. B. 266:1919-1926. West, M and J. Harrison. 1997. Bayesian Forecasting and Dynamic Models (2nd Edn), New York: Springer-Verlag. 287 Whittaker, J. 1990. Graphical models in applied multivariate statistics. Chichester, England: Wiley. Wu, C.F.J. 1983. On the convergence properties of the em algorithm. The Annals of Statistics. 11(1):95-103. Yedidia, J. S., W. T. Freeman and Y. Weiss. 2000. Generalized belief propagation (Technical Report TR2000-26). MERL. www.merl.com/reports/TR2000-26/index.html. Yeo, G. F., R. K. Milne, R. O. Edeson and B. W. Madsen. 1988. Superposition properties of independent ion channels. Proc. R. Soc. Lond. B. 238:255-170. 288

Applications of Hidden Markov Models - QuB

Related documents

Products

Support

Applications of Hidden Markov Models - QuB

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib