Quantum Parameter Estimation∗ Emili Bagan Grup d’Informació Quàntica (GIQ) Departament de Fı́sica Universitat Autònoma de Barcelona November 2021 1 Classical Estimation 1.1 Introduction We will assume that the probability mass function PMF (probability density function PDF) of a discrete (continuous) random variable, X, depends on a parameter, θ, or a set of parameters (parameter vector), θ = (θ1 , θ2 , . . . , θp ). These notes are mainly devoted to the single parameter case. Recall: • PMF: pX (xk ) such that pX (xk ) := Pr(X = xk ) = pk , for all values of k. Z x • PDF: fX (x) such that fX (x0 )dx0 = Pr(X ≤ x) := FX (x) −∞ The function FX (x) is called the distribution function or the cumulative distribution function CDF of the random variable X. Note that FX0 (x) = fX (x). Unless it is necessary for clarity, we will often drop the subscript X and simply write f (x), F (x), and so on. In these notes we (mostly) focus on the continuous case, but one can check that the results we will derive hold also random variables by replacing the PDFs, f (x; θ), by the PMF, R for discrete P p(x, θ) and · dx → x . ∗ Master in Quantum Science and Technology. Quantum Theory: Quantum Statistical Inference. 1 1 CLASSICAL ESTIMATION 1.1 Introduction Example 1.1. We say that X is normally distributed, X ∼ N(µ, σ 2 ), if its PDF is ( 2 ) 1 x−µ 1 exp − , f (x; θ) = f (x; µ, σ 2 ) = √ 2 2 σ 2πσ where µ = E(X) is the mean and σ 2 = var(X) = E[(X − µ)2 ] = E(X 2 ) − [E(X)]2 is the variance. We may view the mean and the variance as parameters: θ = (µ, σ 2 ). Recall that E stands for expectation value: Z E[g(X)] = ∞ g(x)fX (x)dx. −∞ Often throughout these notes, we will use boldface to denote a collection or a vector of random variables, X = (X1 , . . . , Xn ). Likewise x will be also used to denote the corresponding vector of outcomes/observation, x = (x1 , . . . , xn ). The joint PDF will be denoted by fX (x). Hence, e.g., the expectation value of g(X) will be Z E[g(X)] = g(x)fX (x)dn x, where dn x = dx1 dx2 · · · dxn . The aim of (parameter) estimation is to accurately determine the value of θ from observations, i.e., from a set x of outcomes or realizations of the random variable X. We will refer to this set as sample. The relevance of estimation for quantum information should be obvious. A quantum state ρ is just a collection of parameters (its independent entries ρab , for instance) that describe our acknowledge about a system. To emphasize this fact, we could write ρθ instead of just ρ. In order to have a precise mathematical description of the state, an accurate estimation of these parameters θ is required. We can only perform measurements on the system to reveal this information. According to quantum mechanics, their outcomes are random variables, whose probability distributions are given by the Born rule, p(x; θ) = tr(ρθ Ex ), where {Ex } is a collection of operators defining a positive operator-valued measure (POVM) and characterizing the measurement. The classical estimation toolbox, which we are about to introduce, provides us with the means to optimally extract θ from our measurement data. Note that the distribution p(x; θ) also depends on our choice of measurement. But what is the best measurement we can perform on a system to estimate θ? The aim of quantum estimation is to provide means to answer this question. In a more complex scenario, we may wish to characterize the action of a channel Cθ . To do so, we may feed the channel with a system prepared in a fiducial or reference state, ρ0 , and perform a measurement on the output state ρθ = Cθ (ρ0 ). For a fixed ρ0 and a fixed measurement, classical estimation will provide us with the tools to obtain the most precise determination of the unknown θ. E. Bagan 2 1 CLASSICAL ESTIMATION 1.2 Frequentist approach There are several approaches to classical estimation. We will focus on two, which we will refer to as frequentist approach and Bayesian approach. 1.2 Frequentist approach Within the frequentist approach the estimated parameter is assumed to be a deterministic variable with a fixed value. Definition 1.1 Given a sample of random variables (possible outcomes) X = (X1 , . . . , Xn ), a statistic Y is a known function of the sample Y = f (X) . When the statistic is used to estimate the value of a parameter (vector) θ then it is also called a point estimate, or estimator and it is usually denoted by θ̂. Note that a statistic is a random variable itself. In these notes we will always assume that Xi are independent and identically distributed (commonly abbreviated i.i.d.). Example 1.2. It is well known that the sample mean (average), n X̄ := 1X Xi , n i=1 is a “good” estimator of µ. Likewise, n 2 1 X S := Xi − X̄ n − 1 i=1 2 is a “good” estimator of the variance. Both X̄ and S 2 are statistics, since they just depend on X1 , X2 , . . . , Xn . To give a precise meaning to “good” above, we need to discuss some properties of the estimators. Definition 1.2 (Bias) The bias of an estimator θ̂ of a parameter θ is defined as Bias(θ̂) = E(θ̂ − θ). If Bias(θ̂) = 0 then we say that the estimator is unbiased. So, if an estimator is unbiased, in average, it does give the right estimate, which, of course is a desirable property. E. Bagan 3 1 CLASSICAL ESTIMATION Example 1.3. since 1.2 Frequentist approach The sample mean X̄ is an unbiased estimator of the distribution mean µ, n E(X̄) = 1X E(Xi ) = E(X) = µ n i=1 by (obvious) linearity of E. Likewise, S 2 is unbiased. Exercise 1.1. Show that S 2 is an unbiased estimator of var(X). We must show that E(S 2 ) = var(X). n X n n n X X X 2 2 (Xi − X̄) = (Xi − µ) − (X̄ − µ) − 2 (Xi − X̄)(X̄ − µ) 2 i=1 = = i=1 n X i=1 i=1 (Xi − µ)2 − n(X̄ − µ)2 − 2(X̄ − µ) i=1 n X n X ! Xi − nX̄ i=1 (Xi − µ)2 − n(X̄ − µ)2 . i=1 We next take expectation values and recall that µ = E(X) = E(X̄): (n − 1) E(S 2 ) = n var(X) − n var(X̄) = (n − 1) var(X), where we have used that n X 1 Xi var(X̄) = 2 var n i=1 ! = 1 var(X). n The estimates obtained from our samples will be always subject to errors, so we need to quantify them in a suitable way. Definition 1.3 (Mean square error) The mean square error of an estimator θ̂ is h i MSE(θ̂) = E (θ̂ − θ)2 . One can immediately check that MSE(θ̂) = var(θ̂) + Bias(θ̂)2 . (1.1) Exercise 1.2. Check that Eq. (1.1) holds. E. Bagan 4 1 CLASSICAL ESTIMATION 1.2 Frequentist approach h i 2 2 E (θ̂ − θ) = E θ̂ − E(θ̂) + E(θ̂) − θ h i2 h i h i = var(θ̂) + E(θ̂) − θ + 2 E(θ̂) − θ E θ̂ − E(θ̂) h i2 = var(θ̂) + E(θ̂ − θ) + 2 E(θ̂ − θ) × 0 = var(θ̂) + Bias(θ̂)2 . A good estimator is one that has small MSE. If it is unbiased, this is tantamount to having small variance. Notice that var(θ̂) can be determined from the data, wheres MSE(θ̂) cannot, since in practical applications θ is, of course, unknown. Often the goodness of an estimator (or rather of a sequence of estimators) improves as the sample size, n, increases. The next definition captures this idea. Definition 1.4 (Consistency) A sequence of statistics (Yn , n ∈ N) is said to be a consistent estimate of a parameter θ if for every > 0 lim Pr (|Yn − θ| ≤ ) = 1. n→∞ Equivalently, we may write the condition as lim Yn = θ n→∞ (in probability). The sequence {Yn ; n ∈ N} could, according to our notation, be denoted by θ̂n , and we will often (but not always) do so, particularly if we want to emphasize that each yn is an estimate of θ and also indicate that it is based on a sample of size n. Exercise 1.3. Show that if Yn is a consistent estimator of θ with E(Yn2 ) < ∞, then lim E(Yn − θ) = 0. n→∞ We first note that there exists a finite C such that for all n p E (Yn − θ)2 ≤ E(Yn2 ) + θ2 + 2|θ|| E(Yn )| ≤ E(Yn2 ) + θ2 + 2|θ| E(Yn2 ) h i2 p = |θ| + E(Yn2 ) ≤ C, where in the second inequality we have used that p E(|Yn |) ≤ E(Yn2 ) E. Bagan 5 1 CLASSICAL ESTIMATION 1.2 Frequentist approach (as follows immediately from Jensen inequality). We next use Cauchy-Schwarz inequality, [E(|XY |)]2 ≤ E X 2 E Y 2 , to get that, for any > 0, [E(|Yn − θ|)1{|Yn − θ| ≥ /2}]2 ≤ E |Yn − θ|2 E (1{|Yn − θ| ≥ /2}) = E (Yn − θ)2 Pr (|Yn − θ| ≥ /2) ≤ C Pr (|Yn − θ| ≥ /2) , where 1{· · · } is the indicator function. With this, E (|Yn − θ|) = E [|Yn − θ| 1 {|Yn − θ| < /2}] + E [|Yn − θ| 1 {|Yn − θ| ≥ /2}] ≤ /2 + C Pr (|Yn − θ| ≥ /2) . (1.2) But lim Yn = θ n→∞ (in probability) ⇒ lim Pr (|Yn − θ| ≥ /2) = 0. n→∞ This implies that there exists N ∈ N such that for any > 0, Pr (|Yn − θ/2| ≥ ) < /(2C) provided n > N . Hence, from Eq. (1.2) we have |E(Yn − θ)| ≤ E (|Yn − θ|) < +C = , 2 2C which means that lim E(Yn − θ) = 0. n→∞ It is interesting to note that the claim of the exercise ceases to be true if we drop the condition E(Yn2 ) < ∞. Exercise 1.4. Consider the sequence of random variables Yn with probability distribution n−1 if y = θ n 1 pYn (y; θ) = if y = θ + n n 0 otherwise 1. Show that limn→∞ Pr (|Yn − θ| ≤ ) = 1, but limn→∞ E(Yn − θ) 6= 0. 2. Modify the distribution slightly to show that limn→∞ Pr (|Yn − θ| ≤ ) = 1 does not necessarily imply limn→∞ var(Yn ) = 0, even if limn→∞ E(Yn − θ) = 0 and E(Yn2 ) < ∞. E. Bagan 6 1 CLASSICAL ESTIMATION 1.2 Frequentist approach (a) If 0 < < 1 Pr (|Yn − θ| ≤ ) = Pr (Yn = θ) = pYn (θ; θ) = n−1 , n hence limn→∞ Pr (|Yn − θ| ≤ ) = 1. However E (Yn − θ) = (θ − θ) pYn (θ; θ) + (θ + n − θ) pYn (θ + n; θ) 1 = n · = 1 6= 0. n (b) Consider 2 n −1 if y = θ 2 n 1 pYn (y; θ) = if y = θ + n n2 0 otherwise Then E (Yn − θ) = (θ − θ) pYn (θ; θ) + (θ + n − θ) pYn (θ + n; θ) 1 = n · 2 → 0. n 2 θ n −1 2 1 2 E Yn2 = θ2 + (θ + n) = θ + 2 + 1 → θ2 + 1 2 2 n n n Note that as a consequence of the result of Exercise 1.3 any consistent estimator is asymptotically unbiased, in the sense that limn→∞ E(θ̂n ) = θ. Hence, although the biased estimators may lead to improved precision, they may be ignored in the n → ∞ limit for which the frequentist approach is really designed. In dealing with consistency it might be useful to introduce the famous law of large numbers as follows. Theorem 1.5 (Law of Large Numbers) Suppose that {Xi , i ≥ 0} are i.i.d. with finite mean µ and variance σ 2 . Then n 1X µ̂n := Xi n i=1 is a consistent estimator of the mean. In other words for all > 0 Pr (|µ̂n − µ| > ) → 0 as n → ∞. The Law of Large Numbers follows from Chebyshev’s Inequality. E. Bagan 7 1 CLASSICAL ESTIMATION 1.2 Frequentist approach Proposition 1.6 (Chebyshev’s Inequality) Let Y be a random variable with finite mean µ and variance σ 2 . Then for any k > 0, 1 Pr (|Y − µ| ≥ kσ) ≤ 2 . k This proposition in turn follows from Markov’s Inequality: Theorem 1.7 (Markov’s Inequality) Let X > 0 be a random variable, such that E(X) < ∞ and c > 0 a constant. Then E(X) . Pr(X > c) ≤ c Exercise 1.5. Prove Markov’s Inequality and the following slightly more general statement: If X > 0, k ≥ 1 and E(X k ) < ∞, then E(X k ) P(X > c) = P X k > ck ≤ . ck Z ∞ Z ∞ fX (x)dx ≤ Pr(X > c) = c c xk fX (x)dx ≤ ck Z 0 ∞ E(X k ) xk fX (x)dx = ck ck Exercise 1.6. Prove Chebyshev’s Inequality. Exercise 1.7. By using Chebyshev’s inequality, show that if limn→∞ var(θ̂n ) = 0 then asymptotic unbiasedness implies consistency. Exercise 1.8. Prove the Law of Large Numbers. In addition to consistency and unbiasedness, a good estimator should have a small mean square error which for unbiased estimators is just the variance. This motivates the following Definition 1.8 (Relative Efficiency) Given two estimators θ̂1 and θ̂2 of a parameter θ, the relative efficiency of θ̂1 relative to θ̂2 , is denoted by eff(θ̂1 , θ̂2 ) and is defined as MSE θ̂2 . eff θ̂1 , θ̂2 = MSE θ̂1 For unbiased estimators it is equivalent to eff θ̂1 , θ̂2 E. Bagan var θ̂2 . = var θ̂1 8 1 CLASSICAL ESTIMATION 1.2 Frequentist approach Exercise 1.9. Let X1 , . . . , Xn ∼ U(0, M ) the uniform distribution on (0, M ). Consider he estimators n 1X Mn θ̂1 := X̄ = Xi , θ̂2 := cn , n i=1 2 where Mn := max{Xi }ni=1 . and cn is some judiciously chosen n-dependent normalization coefficient. Give cn so that θ̂2 is unbiased. Then compute eff(θ̂1 , θ̂2 ). At this point, the question arises as to how to construct consistent unbiased and effective estimators. Before we attempt to answer this question we still need to give a few definitions. Definition 1.9 (Likelihood function for discrete PD) Suppose X = (X1 , . . . , Xn ) are discrete random variables whose distribution depends on a parameter(vector) θ, and have probability mass function p (x; θ) := Pr (X = x; θ) , where x = (x1 , . . . , xn ) are sample observations. The likelihood of the parameter(vector) θ given the observations x is denoted by L (θ | x) is defined to be L (θ | x) := p (x; θ) that is the joint probability mass function for the parameter(vector) θ. Definition 1.10 (Likelihood function for continuous PD) Suppose X = (X1 , . . . , Xn ) are jointly continuous random variables whose distribution depends on a parameter(vector) θ, and have PDF f (y; θ). The likelihood of the parameter(vector) θ given the observations x is denoted by L (θ | x) is defined to be L (θ | x) := f (x; θ) that is the joint density for the parameter(vector) θ evaluated at the observations. Definition 1.11 (Log-likelihood function) The log-likelihood function of the parameter (vector) θ given the observations x is defined as l(θ) = l (θ | x) = log L (θ | x) , where L (θ | x) is the corresponding likelihood function. If X1 , . . . , Xn are i.i.d. and Xi ∼ f (· ; θ), then L (θ | x) = n Y f (xi ; θ) ; i=1 l (θ | x) = n X log [f (xi ; θ)] . i=1 and similarly for discrete random variables. We think of the likelihood function as a function of θ, and we treat the observations as fixed. Sometimes we will drop the observations and simply write L(θ) and l(θ). E. Bagan 9 1 CLASSICAL ESTIMATION 1.2 Frequentist approach Definition 1.12 (Maximum Likelihood Estimator) Suppose that a sample x = (x1 , . . . , xn ) has likelihood function L(θ) = L (θ | x) depending on a parameter(vector) θ. Then a maximum likelihood estimator (MLE) θ̂ MLE is the value of the parameters that maximizes L(θ), if a maximum exists. In other words θ̂ MLE = arg max L(θ | x) = arg max l(θ | x) θ θ The maximum of L(θ) may not exist, in which case the MLE cannot be constructed. The maximum, if it exists, may not be unique, in which case we will obtain several MLEs. Note that these are not the values of the parameters that are most likely, given the data. To start with, θ in not a random variable in the frequentist approach we are discussing! Theorem 1.13 (Invariance of MLE) Suppose that θ̂ is the MLE for a parameter θ and let t(·) be a strictly monotone function of θ. Then (t(θ))MLE = t(θ̂), i.e., the MLE of t(θ) is t(θ̂). Exercise 1.10. Consider the exponential distribution f (x | λ) = λe−λx . Suppose we take a sample of size n. Show that the MLE of λ is λ̂ = 1/X̄. The likelihood, is L (λ | x1 , . . . , xn ) = n Y n X −λxi n λe = λ exp −λ xi i=1 ! = λn exp(−nλx̄) i=1 Then l = n log λ − nλx̄ and so n d l = − nx̄ dλ λ Thus L has a unique maximum at λ̂ = 1/x̄ and this is therefore the maximum likelihood estimator of λ. Exercise 1.11. Find the maximum likelihood estimates of the parameters of the normal distribution. Note that the MLE of σ 2 is a biased estimator. n n 1 X l = − log(2π) − n log σ − 2 (xi − µ)2 2 2σ i=1 E. Bagan 10 1 CLASSICAL ESTIMATION 1.2 Frequentist approach Differentiating with respect to each parameter and setting equal to zero: n 1 X (xi − µ) = 0; σ 2 i=1 n 1 X n (xi − µ)2 = 0. − + 3 σ σ i=1 It follows that n µ̂ = 1X xi n i σ̂ 2 = We note that σ̂ 2 = 1X (xi − x̄)2 . n n−1 2 S . n Since S 2 is unbiased, σ̂ 2 must be biased. Proposition 1.14 (the MLE of an i.i.d observation is consistent) Let {X1 , . . . , Xn } be a sequence of i.i.d. observations where i.i.d. Xk ∼ f (x; θ). Then the MLE of θ is consistent Let x1 , . . . , xn be a sample drawn from a population with PDF fθ (x). When the sample is used to estimate the parameter θ, an obvious question arises: What is the lowest variance we can achieve? Definition 1.15 (Fisher Information) Let X ∼ f (· ; θ). Then the Fisher Information is given by " 2 # ∂l(θ | x) In (θ) := n E , ∂θ where l(θ | x) is the log-likelihood. It can be shown that if the second partial derivative exists then we also have that 2 ∂ l(θ | x) . In (θ) = −n E ∂θ2 Exercise 1.12. Prove this last statement. Z Z ∂2 ∂θ f (x; θ) 2 E l(θ | x) = f (x; θ)∂θ log [f (x; θ)] dx = f (x; θ)∂θ dx ∂θ2 f (x; θ) ( ) Z ∂θ2 f (x; θ) [∂θ f (x; θ)]2 − dx, = f (x; θ) f (x; θ) f 2 (x; θ) E. Bagan 11 1 CLASSICAL ESTIMATION 1.2 Frequentist approach where we have used the obvious notation ∂θ := ∂/∂θ. The first term in the integral vanishes, since Z Z Z ∂θ2 f (x; θ) 2 2 f (x; θ) dx = ∂θ f (x; θ)dx = ∂θ f (x; θ)dx = ∂θ 1 = 0. f (x; θ) Finaly, the second term can be written as Z − ∂θ f (x; θ) f (x; θ) f (x; θ) 2 Z dx = − f (x; θ) [∂θ l(θ | x)]2 dx = − E " ∂l(θ | x) ∂θ 2 # . Theorem 1.16 (Cramer-Rao bound) Let X = (X1 , . . . , Xn ) be i.i.d. with probability density function f (x; θ). Let θ̂n = g (X) be an unbiased estimator of θ, such that the support of g(X) (the region for which the probability is not zero) does not depend on θ. Then under mild conditions we have that 1 var θ̂n ≥ . In (θ) The proof relies on the following very general theorem/definition Theorem 1.17 Let X and Y be two random variables. Then, their correlation coefficient, defined as cov(X, Y ) , corr(X, Y ) := p var(X) var(Y ) satisfies 1 ≤ corr(X, Y ) ≤ 1. We recall that cov(X, Y ) = E(XY ) − E(X) E(Y ), cov(X, X) = var(X). The content of this Theorem is that cov actually obeys the Cauchy-Schwarz inequality. Exercise 1.13. Prove Theorem 1.17. Hint. First check that linearity of the expectation, E, implies corr(aX + c, bY + c0 ) = corr(X, Y ), hence, it suffices to prove the theorem assuming E(X) = E(Y ) = 0 and var(X) = var(Y ) = 1. Next, consider the trivial inequality 0 ≤ E [(X − λY )2 ], where λ ∈ R. One can immediately check that from the very definition of E one has E(aX +c) = a E(X)+c. Then, var(aX + c) = E{[(aX + c − E(aX + c)]2 } = a2 E{[(X − E(X)]2 } = a2 var(X), and E[(aX + c)(bY + c0 )] = E[abXY + cbY + c0 aX + cc0 ] = ab E(XY ) + cb E(Y ) + c0 a E(X) + cc0 = ab[E(XY ) − E(X) E(Y )] + [a E(X) + c][b E(Y ) + c0 ], E. Bagan 12 1 CLASSICAL ESTIMATION 1.2 Frequentist approach which proves that cov(aX + c, bY + c0 ) = ab cov(X, Y ). Hence, corr(aX + c, bY + c0 ) = corr(X, Y ). We see that X − E(X) X0 = p ; var(X) Y − E(Y ) Y0 = p var(Y ) ⇒ corr(X 0 , Y 0 ) = corr(X, Y ) and E(X 0 ) = E(Y 0 ) = 0 and var(X 0 ) = var(Y 0 ) = 1. This proves the first statement in the exercise. Next, we have 0 ≤ E (X − λY )2 = λ2 − 2λ E(XY ) + 1. For this to hold, the polynomial in λ on the right hand side can have at most one root, which implies that the discriminant must be non-positive: 1 ≥ [E(X, Y )]2 = [corr(X, Y )]2 . Proof of the Cramer-Rao Bound (CRB). Consider the random variable W defined by ∂θ f (X; θ) , f (X; θ) Q where X = (X1 , . . . , Xn ), f (x; θ) is the joint PDF, f (x; θ) = ni=1 f (xi ; θ), and ∂θ := ∂/∂θ. Hence Z Z Z ∂θ f (x; θ) d n n E(W ) = f (x; θ)d x = ∂θ f (x; θ)d x = f (x; θ)dn x = 0 f (x; θ) dθ W = ∂θ log f (X; θ) = under fairly general conditions that guarantee we differentiation and integra canexchange tion. Since E(W ) = 0, we have that cov W, θ̂n = E W θ̂n , thus cov W, θ̂n Z ∂θ f (x; θ) = g (x) f (x; θ)dn x = f (x; θ) d dθ = E(θ̂n ) = = 1. dθ dθ Z d g (x) ∂θ f (x; θ)d x = dθ n Z g(x)f (x; θ)dn x From Theorem 1.17 we have 1 ≥ [corr(W, θ̂n )]2 = cov2 (W, θ̂n ) var(W ) var(θ̂n ) , which, since cov(W, θ̂n ) = 1, implies that var(θ̂n ) ≥ E. Bagan 1 . var(W ) 13 1 CLASSICAL ESTIMATION 1.2 Frequentist approach Note that up to this point we have not used the i.i.d. condition, hence, the last bound holds in the general situation. Assuming Qn now that X1 , . . . , Xn are i.i.d., we know that the joint distribution is simply f (x; θ) = i=1 f (xi ; θ), therefore W = ∂θ n X log f (Xi , θ) = i=1 n X ∂θ log f (Xi , θ). =: i=1 n X Wi i=1 Using again the independency condition, we have var(W ) = n X var(Wi ) = n E [∂θ log f (x, θ)]2 = n E [∂θ l(θ | x)]2 = In (θ). i=1 This completes the proof. In particular, the first equality in the last line of the proof states that (additivity) (1,2) I1 (1) (2) (θ) = I1 (θ) + I1 (θ) for independent random variables with joint distribution fX1X2(x1 ,x2 ;θ) = fX1 (x1 ;θ)fX2 (x2 ;θ). Being non-negative [as follows from (its very) Definition 1.15] and additive the FI has the interpretation of an information measure. Its increase indicates that a higher precision is potentially achievable in parameter estimation. In particular, at a given θ0 , In (θ0 ) = 0 proves that one cannot extract any information about the parameter from a sample, whereas divergent In (θ0 ) = ∞ implies that the true value θ0 can in principle be perfectly determined. Definition 1.18 (Efficiency) The efficiency of an unbiased estimator θ̂n of a parameter θ is defined as the ratio of the Cramer-Rao bound to the variance of θ̂n , that is 1 . eff θ̂n = In (θ) var θ̂n An estimator which has unit efficiency [the maximum value eff(θ̂n ) can take] is called efficient. Exercise 1.14. Let X1 , . . . , Xn be i.i.d. with PDF f (x; λ) = λe−λx . In Exercise 1.10P it was shown that λ̂MLE = 1/X̄. Show now that λ̂MLE is not unbiased, whereas λ̂n = (n − 1)/( ni=1 Xi ) is. Show that 2 eff(λ̂n ) = 1 − . n This is less than unity, and hence, it is not efficient. However, the efficiency approaches unity as n → ∞. In such cases we say that λ̂n is an asymptotically efficient estimator. Let us compute E(λ̂MLE ). We will do it brute force: Z Pn n Pn E(λ̂MLE ) = λn e−λ i=1 xi dn x. Rn i=1 xi + E. Bagan 14 1 CLASSICAL ESTIMATION 1.2 Frequentist approach Insert the identity ∞ Z δ( Pn i=1 xi − w) dw = 1, 0 where δ(x) is the Dirac delta function (distribution). We have Z ∞ −λw Z P e n δ ( ni=1 xi − w) dn x. E(λ̂MLE ) = nλ dw w 0 Rn + Scale xi as xi = wyi , then dn x = wn dn y, and Z ∞ Z n n−1 −λw E(λ̂MLE ) = nλ w e dw Rn + 0 = nλn Z ∞ wn−2 e−λw dw Z δ( Rn + 0 P δ [w ( ni=1 yi − 1)] dn y Pn i=1 yi − 1) dn y = n(n − 2)!λ vol(∆n ), P where vol(∆n ) is the volume of the simplex ∆n = {(y1 , . . . , yn ) | nk=1 yk = 1}. Using the same trick with E(1) = 1 we have Z Z Z ∞ P P n−1 −λw n −λ n n xi n i=1 w e dw δ ( ni=1 yi − 1) dn y = (n − 1)! vol(∆n ). 1= λ e d x=λ Rn + Rn + 0 Hence vol(∆n ) = 1/(n − 1)! and E(λ̂MLE ) = n n(n − 2)! λ= λ. (n − 1)! n−1 We see that λ̂MLE is not unbiased (though it is asymptotically unbiased ). Thus λ̂n = n−1 n−1 λ̂MLE = Pn n i=1 Xi is unbiased. Next, let us compute the efficiency of this estimator. We first need to compute the variance, which we do applying once again the very same trick as before 2 Z Pn n−1 Pn var(λ̂n ) = − λ λn e−λ i=1 xi dn x Rn i=1 xi + 2 Z ∞ n−1 n n n−1 w = vol(∆ )λ − λ e−λw dw w Z ∞0 λn = (n − 1)2 wn−3 − 2(n − 1)λwn−2 + λ2 wn−1 e−λw dw (n − 1)! 0 E. Bagan 15 1 CLASSICAL ESTIMATION 1.2 Frequentist approach (n − 1)2 (n − 3)! − 2(n − 1)(n − 2)! + (n − 1)! 2 λ (n − 1)! n−1 − 1 λ2 = n−2 λ2 = . n−2 = We also need the Fisher information: ( 2 ) ∂ In (λ) = n E (log λ − λx) ∂λ " 2 # 1 = nE −x λ Z ∞ x 1 2 − 2 + x e−λx = nλ 2 λ λ 0 n = 2. λ Combining these last two results we get 1 1 2 = eff λ̂n = =1− . 2 λ n n In (θ) var θ̂n · 2 λ n−2 Alternatively, we could have computed vol(∆n ) by a change of variables. For instance, if we had to compute the integral of some function g(y1 , y2 , . . . , yn ) over the simplex ∆n , i.e., Z P δ ( ni=1 yi − 1) g(y1 , y2 , . . . , yn )dn y, Ig = Rn + we could define, e.g., y1 = u1 , y2 = (1 − u1 )u2 , y3 = (1 − u1 )(1 − u2 )u3 , .. .. . . yn−1 = (1 − u1 )(1 − u2 ) · · · un−1 , yn = (1 − u1 )(1 − u2 ) · · · (1 − un−1 ). P (Note that the variable yn is not independent!) So that ni=1 yi = 1. Note that 0 ≤ ui ≤ 1, for i = 1, 2, . . . , n − 1. The Jacobian of the change is very easy to compute because the E. Bagan 16 1 CLASSICAL ESTIMATION 1.2 Frequentist approach Jacobian matrix is lower triangular 1 0 0 ∗ 1 − u1 0 ∂(y1 , . . . , yn−1 ) ∗ ∗ (1 − u )(1 − u2 ) 1 = ∂(u1 , . . . , un−1 ) .. .. .. . . . ∗ ∗ ∗ Hence ··· ··· ··· .. . ··· 0 0 0 .. . . Qn−2 i=1 (1 − xi ) ∂(y1 , . . . , yn−1 ) = (1 − un−2 )(1 − un−3 )2 (1 − un−4 )3 · · · (1 − u1 )n−2 , ∂(u1 , . . . , un−1 ) and we have Z 1 Z 1 Z 1 Q dun−1 (1 − un−2 )dun−2 · · · (1 − u1 )n−2 g u1 , (1 − u1 )u2 , . . . , ni=1 (1 − ui ) du1 . Ig = 0 0 0 In the particular case g ≡ 1 we obtain vol(∆n ) = I1 = 1 · 1 1 1 1 · ··· = . 2 3 n−1 (n − 1)! Proposition 1.19 (Asymptotic normality) Let {X1 , · · · , Xn } be a sequence of i.i.d. observations where i.i.d. Xk ∼ f (x; θ). Let θ̂ be a MLE of θ, then √ d n(θ̂n − θ) → N 0, 1 I1 (θ) . (See Lehmann, Elements of Large Sample Theory, Springer, 1999 for a proof.) The meaning of convergence in distribution is given in this Definition 1.20 (Convergence in Distribution) A sequence of random variables {X1 , X2 , X3 , . . . } d converges in distribution to a random variable X, shown by Xn → X, if lim FXn (x) = FX (x) n→∞ for all x at which the CDF, FX (x), is continuous. At this point it is also useful to recall the central limit theorem, which we quote without proof Theorem 1.21 (Central Limit Theorem). Let X1 , X2 , . . . be i.i.d. random variables with E (Xi ) = µ and var (Xi ) = σ 2 < ∞. Define Pn Xi − nµ X̄ − µ √ . Zn := i=1 √ = σ n σ/ n Then the distribution function of Zn converges to the distribution function of a standard normal random variable as n → ∞. I.e., Zn converges in distribution to a normally distributed random variable X ∼ N(0, 1). E. Bagan 17 1 CLASSICAL ESTIMATION 1.2 Frequentist approach Exercise 1.15. Show that a MLE is asymptotically efficient. √ d d If θ̂n is a MLE, n(θ̂n − θ) → N(0, I1−1 (θ)), which implies θ̂n → N(θ, In−1 (θ)). Hence var(θ̂n ) → In−1 and eff(θ̂n ) → 1. We next wonder if a given estimator extracts all the information about the unknown parameter θ that is available in our samples. Assume we have observed a particular value of θ̂. In general, there would be various outcomes x = (x1 , . . . , xn ) that would lead to this particular estimate. If their distribution does not depend on θ, knowing which of them has specifically happened does not provide further information about the value of θ. This motivates the following definition. Definition 1.22 (Sufficient Statistic). Let X = (X1 , . . . , Xn ) be i.i.d. from a probability distribution with parameter θ. Then the statistic T (X) is called a sufficient statistic for θ if the conditional distribution of X1 , . . . , Xn given the value of T does not depend on θ. There is no need to compute conditioned PDFs to check whether some statistic is sufficient thanks to the following Theorem 1.23 A statistic T (X) is a sufficient statistic for θ if and only if the joint probability density of X can be factorised into two factors, one of which depends only on T and the parameters while the other is independent of the parameters: f (x; θ) = g(t; θ)h (x) . We do not prove this theorem here. The second factor may be written in terms of t, since it is a functions of the outcomes, but it cannot depend on θ. Theorem 1.24 Efficient estimators are sufficient. The converse is not true; there exist sufficient estimators/statistics that are not efficient. Proof Theorem 1.23. From the proof of Theorem 1.16 we know that if θ̂ is unbiased cov(W, θ̂) = 1. If, moreover, θ̂ is efficient, var(θ̂) = 1/ var(W ), hence h i2 E W − var(W )(θ̂ − θ) = var(W ) + var(W )2 var(θ̂) − 2 var(W ) E[W (θ̂ − θ)] = 2 var(W ) − 2 var(W ) cov(W, θ̂) = 0. Since E(X 2 ) = 0 ⇒ Pr(X = 0) = 1, it must be the case that W = var(W )(θ̂ − θ) := a(θ)θ̂ + b(θ). E. Bagan 18 1 CLASSICAL ESTIMATION 1.2 Frequentist approach But, since W = ∂ log f (X; θ), ∂θ we see that h i h i f (X; θ) = exp A(θ)θ̂ + B(θ) + C(X) = exp A(θ)θ̂ + B(θ) K(X). So, f (x; θ) = exp[A(θ)θ̂ + B(θ)]K(x), and since according to Theorem 1.23 this is the required factorization for sufficientcy, θ̂ is sufficient. Example 1.4. Suppose we use x̄ to estimate λ, the parameter of the Poisson distribution p(k; λ) = λk −λ e . k! For a sample of size n we have p (x1 , x2 , . . . , xn ; λ) = n Y e−λ λxi xi ! i=1 Pn = e−nλ λ n Y 1 = e−nλ λnx̄ x! i=1 i i=1 xi n Y 1 x! i=1 i ! , which is the required factorization according to Theorem 1.23. Hence x̄ is sufficient. Exercise 1.16. Show that (X̄, S 2 ), defined in Example 1.2, is sufficient statistic to estimate the parameters µ and σ 2 of a normal distribution, Example 1.1. Theorem 1.25 From any unbiased estimator that is not based on a sufficient statistic, an improved estimate can be obtained which is based on the sufficient statistic. It is unbiased and it has smaller variance, and is obtained by averaging with respect to the conditional distribution given the sufficient statistic. So, if R (X) is an unbiased estimate of the parameter θ and T (X) is a sufficient statistic for θ. The conditional distribution of R given T is fR | T (r | t) = fRT (r, t; θ) , fT (t; θ) where fRT (r, t; θ) is the joint probability density function of R and T , and Z ∞ fT (t; θ) = f (r, t; θ)dr −∞ E. Bagan 19 1 CLASSICAL ESTIMATION 1.2 Frequentist approach is the marginal distribution for T . Because T is a sufficient statistic fR | T (r | t) does not depend on θ. The improved estimate of θ, S(T ), is the function of T that is obtained by averaging R with respect to its conditional distribution given T . Z ∞ rfR | T (r | T )dr. S(T ) := E[R | T ] = −∞ Exercise 1.17. Prove Theorem 1.25. Since R is an unbiased estimator of θ it satisfies Z ∞Z ∞ rfRT (r, t; θ)dr dt = θ. E[R] = −∞ −∞ Let us check that S is also unbiased: Z ∞Z ∞ Z ∞ rfR | T (r, t)fT (t; θ)dr dt s(t)fT (t; θ) dt = E(S) = −∞ ∞ −∞ Z ∞Z ∞ = rfRT (r, t; θ)dr dt = E(R) = θ. −∞ ∞ It only remains to check that var(S) ≤ var(R): var(R) = E (R − θ)2 = E [(S − θ) + (R − S)]2 = var(S) + E (R − S)2 + 2E[(R − S)(S − θ)]. (1.3) Let us check that the last term is identically zero: Z ∞Z ∞ [r − s(t)][s(t) − θ]fR,T (r, t; θ)dr dt E[(R − S)(S − θ)] = −∞ −∞ Z ∞ Z ∞ [r − s(t)]fR | T (r | t)dr [s(t) − θ]fT (t; θ)dt. = −∞ −∞ But the inner integral is zero. Since E[(R−S)2 ] ≥ 0 we see from Eq. (1.3) that var(S) ≤ var(R). Exercise 1.18. We toss a coin n times [head = 1, tail = 0]. We decide to use X1 (the result of the first toss, ignoring the rest) as estimate of the probability p of a head, i.e., θ̂ = X1 . Check θ̂ is unbiased. Compute its variance. Next, check that in n trials the proportion of heads is a sufficient statistic. Construct an improved estimate from θ̂ based on the proportion of heads in n trials using Theorem 1.25 and check explicitly that it has smaller variance. E. Bagan 20 1 CLASSICAL ESTIMATION 1.3 1.3 Bayesian approach Bayesian approach Within the Bayesian approach, the estimated parameter θ is assumed to be a random variable, Θ, that is distributed according to a prior PDF, f (θ), representing the knowledge about θ one possesses before performing the estimation. Therefore, in contrast to the frequentist philosophy, where the estimated parameter was assumed to have a fixed, well defined value, it is a particular realization of the parameter that is really estimated in a real-life experiment. As a consequence, an optimal estimator must not only be global and minimize the MSE, but also has to take into account which values of Θ are more probable according to f (θ). Hence, such an estimator must minimize the Average Mean Squared Error (MSE): Z Z MSE(θ̂) = f (θ)dθ (θ̂ − θ)2 f (x | θ)dn x, where we recall that the estimator θ̂ is a function of the sample x = (x1 , . . . , xn ) and where f (x | θ) is the PDF previously labelled as f (x; θ) within the frequentist approach, which due to stochastic character of the parameter now represents a conditional density. The last definition can also be written as Z MES(θ̂) = (θ̂ − θ)2 f (x, θ)dn x dθ, where the joint PDF, f (x, θ), is defined via Bayes’ theorem –hence the name of the approach– in two equivalent ways: f (x, θ) = f (x | θ)f (θ) = f (θ | x)f (x) [we abuse notation here by using the same letter f to denote all PDFs. In a more precise notation one should write fXΘ (x, θ) = fX|Θ (x|θ)fΘ (θ) = fΘ|X (θ|x)fX (x), but the R we drop n Rsubscripts to ease the notation] In general, the conditional PDFs satisfy f (x | θ)d x = f (θ | x)dθ = 1 and the probability of a particular sample corresponds to the marginal R f (x) = f (x, θ)dθ. We, then, can also write Z Z n 2 MSE(θ̂) = f (x)d x (θ̂ − θ) f (θ | x)dθ . (1.4) The minimum of this expression is attained by minimizing the square bracket for each outcome x: Z Z 2 ∂θ̂ (θ̂ − θ) f (θ | x)dθ = 0 ⇒ θ̂ = θf (θ | x)dθ = EΘ|X (Θ). (1.5) The optimal Minimum Mean Squared Error (MMSE) estimator simply corresponds to the average parameter value computed with respect to the posterior PDF, f (θ | x), that in principle may always be computed using Bayes’ theorem: f (θ | x) = R E. Bagan f (x | θ)f (θ) . f (x | θ)f (θ)dθ (1.6) 21 1 CLASSICAL ESTIMATION 1.3 Bayesian approach Within the Bayesian framework, one should view the process of data inference as a procedure in which the effective PDF of the estimated parameter θ becomes updated. Hence, the posterior PDF f (θ | x) represents the prior f (θ) that has been reshaped and narrowed-down after learning the sample x: observe x f (θ) −→ f (θ | x) whereas the MMSE estimator (1.5) just outputs the mean of such an effective distribution. Moreover, the minimal MSE (1.4) then reads Z Z Z 2 n MSE(θ̂) = f (x)d x f (θ | x) θ − EΘ|X (Θ) = f (x) var Θ|X (Θ)dn x, so that it represents the variance of the parameter Θ computed also with respect to f (θ | x) and averaged over all the possible outcomes. It is really important within the Bayesian approach to choose an appropriate f (θ) such that, on one hand, it adequately represents the knowledge about the parameter before the estimation, but, on the other, it does not significantly overshadow the information obtained from the data collected. Exercise 1.19. Consider the extremal case where the prior PDF is the Dirac delta distribution, fδ (θ) = δ (θ − θ0 ), which represents the case when we perfectly know the estimated parameter before performing the estimation. Compute the MMSE θ̂ and discuss the role of the observations. What is MSE(θ̂)? Note that so far we did not require at any stage the sampled data to be independently distributed. Such property, which previously was heavily used within the frequentist approach, is not necessary in the derivation of the optimal Bayesian estimator, which relies only on the form of the posterior PDF (1.6). In fact, as independently distributed data may be interpreted as if it was collected carrying out consecutive repetitions of the estimation protocol, the Bayesian results in such a case may be understood as a progressive updating of the knowledge we possess about the parameter, where at each step the posterior is calculated based only on the outcomes xi but for the prior already updated with the results xi−1 . observe x1 observe x2 observe x3 f (θ) −→ f (θ | x1 ) −→ f (θ | x1 , x2 ) −→ f (θ | x1 , x2 , x3 ) · · · Exercise 1.20. Show that the interpretation of (1.6) for independent samples as progressive updating of the prior PDFs is correct. Consider the obvious relations [We use independency in the very first line: f (x1 , . . . , xn | θ) = Q n i=1 f (xi | θ) = f (xn | θ)f (x1 , . . . , xn−1 | θ)]: E. Bagan 22 1 CLASSICAL ESTIMATION 1.3 Bayesian approach f (θ | x1 , x2 , . . . , xn ) = R f (xn | θ)f (x1 , . . . , xn−1 | θ)f (θ) f (xn | θ)f (x1 , . . . , xn−1 | θ)f (θ)dθ =R f (xn | θ)f (θ | x1 , . . . , xn−1 )f (x1 , . . . , xn ) f (xn | θ)f (θ | x1 , . . . , xn−1 )f (x1 , . . . , xn )dθ =R f (xn | θ)f (θ | x1 , . . . , xn−1 ) . f (xn | θ)f (θ | x1 , . . . , xn−1 )dθ So, the updating in the last step from the prior PDF f (θ | x1 , . . . , xn−1 ) is based entirely on the observation xn . We can, obviously, repeat the procedure with f (θ | x1 , . . . , xn−1 ), and so on. The MMSE estimator plays and special role because of the following result. For any regular prior f (θ) one has 1 1 n→∞ , (1.7) MSE(θ̂) = EΘ ≥ In (Θ) EΘ [In (Θ)] where the last expression follows from the Jensen inequality stating that for any convex function f (X) one has E[f (X)] ≥ f [E(X)]. If the Fisher information In is independent of θ we can, of course, drop the expectations and the bound saturates. This relation enables us to establish a connexion between the Bayesian and frequentist approaches. Alternatively, one could prove the last inequality in (1.7) by invoking the Hölder inequality: Z Z |g(x)h(x)|dx ≤ For g(x) = p p |g(x)| dx xf (x), h(x) = f (x)dx ≤ q 1/q |h(x)| dx , for all p, q such that 1 1 + = 1. p q p f (x)/x, p = q = 2, we have Z Z 1= 1/p Z 1/2 Z 1/2 f (x) |xf (x)dx dx , x Therefore 1 ≤ E(X) E(X −1 ) ⇒ E(X −1 ) ≥ assuming x > 0. 1 . E(X) Within the Bayesian framework, nothing prevents us to consider other figures of merit, i.e., cost functions C(θ̂, θ), in order to generalize the MSE, and define the average cost, EΘ [C(θ̂)]: Z Z C(θ̂) = f (θ)dθ C(θ̂, θ)f (x | θ)dx. The MSE is the special case C(θ̂, θ) = (θ̂ − θ)2 . E. Bagan 23 2 QUANTUM ESTIMATION Example 1.5. To deal with a circularly symmetric parameter, we can consider the simplest cost function introduced by Holevo: ! θ̂ − θ . (1.8) CH (θ̂, θ) = CH (θ̂ − θ) = 4 sin2 2 It is periodic (as it should if one has circular symmetry) and CH (θ̂, θ) ∼ (θ̂ − θ)2 (i.e., approaches the squared error) as θ̂ → θ. 2 2.1 2.1.1 Quantum Estimation Frequentist (pointwise) approach The quantum Cramer-Rao bound As we mentioned in the Rintroduction, in quantum mechanics we have (Born rule) f (x; θ) = tr (Ex ρθ ), where {Ex } , dxEx = 1, are the elements of a POVM and ρx is the density operator parametrized by the quantity we want to estimate. Let us introduce the Definition 2.1 [Symmetric Logarithmic Derivative (SLD)] The SLD, Lθ is the self-adjoint operator satisfying the equation ∂ρθ Lθ ρθ + ρθ Lθ = = ∂θ ρθ . 2 ∂θ Note that ∂θ f (x | θ) = ∂θ tr [Ex ρθ ] = tr [Ex ∂θ ρθ ] Lθ ρθ + ρθ Lθ = tr Ex 2 1 1 = tr [Ex Lθ ρθ ] + tr [Ex ρθ Lθ ] 2 2 i∗ 1 1h = tr [Ex Lθ ρθ ] + tr (Ex ρθ Lθ )† 2 2 1 1 = tr [Ex Lθ ρθ ] + [tr Lθ ρθ Ex ]∗ , 2 2 where we have used the cyclic property of the trace. We can then write ∂θ f (x | θ) = < [tr (ρθ Ex Lθ )] . We can use this result to express the Fisher information as Z {< [tr (ρθ Ex Lθ )]}2 I1 (θ) = dx . tr(ρθ Ex ) E. Bagan (2.9) 24 2 QUANTUM ESTIMATION 2.1 Frequentist (pointwise) approach The numerator of the integrant can be bounded as √ p p √ 2 2 2 {< [tr (ρθ Ex Lθ )]} ≤ |tr (ρθ Ex Lθ )| = tr ρθ Ex Ex Lθ ρθ √ p p √ p p √ √ ρθ Ex Ex ρθ tr Ex Lθ ρθ ρθ Lθ Ex ≤ tr (2.10) (2.11) = tr(ρθ Ex ) tr(Lθ Ex Lθ ρθ ), where we have used the Schwartz inequality: 2 tr A† B ≤ tr A† A tr B † B . (2.12) We have also used that ρθ , Ex ≥ 0 and Lθ is self-adjoint. Substituting this bound in Eq. (2.9) we have Z Z dxEx Lθ ρθ = tr(L2θ ρθ ), I1 (θ) ≤ dx tr(Lθ Ex Lθ ρθ ) = tr Lθ (2.13) which states that the Fisher information I1 (θ) of any quantum measurement is bounded by the so-called Definition 2.2 (Quantum Fisher Information QFI). The QFI is defined as H(θ) := tr(L2θ ρθ ) = tr [Lθ (∂θ ρθ )] . The second form of the definition follows from tr(L2θ ρθ ) = ρθ Lθ + Lθ ρθ tr(Lθ ρθ Lθ ) + tr(L2θ ρθ ) = tr{Lθ } = tr {Lθ (∂θ ρθ )} . 2 2 Eq. (2.13), through Theorem 1.16, leads to Theorem 2.3 (Quantum Cramér-Rao bound) The variance of any estimator θ̂n of the parameter θ characterizing the family of states ρθ is bounded by var(θ̂n ) ≥ 1 . nH(θ) This is the quantum version of the Cramér-Rao theorem and provides an ultimate bound the the sensitivity that can be achieved in parameter estimation in the quantum mechanical framework. The quantum Fisher Information is an upper bound for the Fisher Information as it embodies the optimization of the Fisher Information over any possible measurement. Optimal quantum measurements for the estimation of θ thus correspond to POVM with Fisher information equal to the quantum Fisher information, i.e., those saturating both inequalities Eq. (2.10) and Eq. (2.11). The first one is saturated when tr [ρθ Ex Lθ ] is a real number. The second one is based on the Schwartz inequality, Eq. (2.12), which is saturated when matrices A and B are proportional. Hence, we must have p √ p √ Ex ρθ = cx Ex Lθ ρθ E. Bagan 25 2 QUANTUM ESTIMATION 2.1 Frequentist (pointwise) approach for all x. This condition can always be met by choosing the operators Ex to be assembled from one-dimensional projectors onto a complete set of orthonormal eigenstates of Lθ . Exercise 2.1. Consider the set of qubit pure states that lie in the equator of the Bloch sphere (θ = π/2). They are parametrized by the azimuthal angle φ. Compute the SLD, Lφ , for this one-parameter family. Compute the QFI, H(φ). Show that by measuring n copies of these equatorial states on the orthogonal bases {|+i, |−i} and using the MLE estimator to process the classical data obtained from the measurement the QCR bound is saturated asymptotically. You should attempt to solve this exercise by yourself, but because of its relevance to a discussion about the Heisenberg limit below, we provide a solution here. Solution. Let us compute the SLD brute force (we will learn about other methods below). Define the unit vectors r̂ = (cos φ, sin φ, 0) and φ̂ = (− sin φ, cos φ, 0) = ∂φ r̂, we have ρφ = Write 1 + r̂ · σ 2 ⇒ ∂φ ρφ = φ̂ · σ . 2 Lφ = a1 + b · σ. Then, using the definition of the SLD, we must have 2φ̂ · σ = (a1 + b · σ)(1 + r̂ · σ) + (1 + r̂ · σ)(a1 + b · σ) X = 2a1 + 2(b + ar̂) · σ + (bi r̂j + bj r̂i )σi σj ij = 2(a + b · r̂)1 + 2(b + ar̂) · σ. It follows that a + b · r̂ = 0 and b + ar̂ = φ̂. Since r̂ · φ̂ = 0, the second condition implies the first one and b = φ̂ − ar̂ for any a. Hence, the SLD is not uniquely defined: Lφ = a1 + (φ̂ − ar̂) · σ for any a ∈ R. From the definition of QFI, we readily see that ) ( h i φ̂ · σ = (φ̂ − ar̂) · φ̂ = 1. H(φ) = tr a1 + (φ̂ − ar̂) · σ 2 E. Bagan 26 2 QUANTUM ESTIMATION 2.1 Frequentist (pointwise) approach The PMF of the measurement {|+i, |−i} is 1 + eiφ p(+; φ) = tr (|+ih+|ρφ ) = |h+|ψφ i| = 2 φ , = cos2 2 φ 2 p(−; φ) = sin . 2 2 2 So, we see that the outcome of each individual measurement is a Bernoulli random variable. By assigning 1 to outcome + and 0 to outcome −, the PMF of such variable is φ φ 2x 2(1−x) X ∼ p(x; φ) = cos sin . 2 2 The (log-)likelihood function is 1 ± cos φ L(φ | ±) = ; 2 l(φ | ±) = log Hence ∂φ l(φ | ±) = 1 ± cos φ 2 . ∓ sin φ , 1 ± cos φ and " I1 (φ) = E ∓ sin φ 1 ± cos φ 2 # sin2 φ 1 − cos φ sin2 φ 1 + cos φ + = 1. = (1 + cos φ)2 2 (1 − cos φ)2 2 From which In (φ) = n, and we know that the MLE of this measurement will saturate the bound asymptotically. Measuring each individual copy of the given n states on the {|+i, |−i} basis we have nx̄ n(1−x̄) φ φ 2 p(x1 , . . . xn ; φ) = cos sin . 2 2 2 Checking the right hand side of this expression we clearly see that X̄ is a sufficient statistic for φ. We also see that X̄ is binomially distributed: φ X̄ ∼ Bin n, cos 2 2 ⇔ pX̄ (x̄; φ) = n nx̄ nx̄ n(1−x̄) φ φ 2 2 cos sin . 2 2 Hence, L(φ | x̄) = pX̄ (x̄; φ). A straightforward derivation leads to the expression of the MLE φ̂ = arccos (2x̄ − 1) . E. Bagan 27 2 QUANTUM ESTIMATION 2.1 Frequentist (pointwise) approach This completes the solution of the exercise. Going back to our general discussion, one can find a closed form for the SLD in terms of the spectral representation of ρθ , X ρθ = λa |ψa ihψa |. a It is given by Lθ = X 2hψa |∂θ ρθ |ψb i λa + λb a,b |ψa ihψb |, (2.14) where the sum extends to all a and b such that λa + λb 6= 0. Exercise 2.2. Prove Eq. (2.14). 2.1.2 The pure state model A much simpler expression can be written for pure states. A straightforward calculation gives Lθ = 2∂θ ρθ = 2 (|∂θ ψθ i hψθ | + |ψθ i h∂θ ψθ |) . (2.15) From this result one can easily obtain the QFI of this so called pure state model: H(θ) = 4 h∂θ ψθ |ψθ i2 + h∂θ ψθ |∂θ ψθ i . (2.16) Exercise 2.3. Prove that for pure states one can choose the SLD to be given by Eq. (2.15). By using this result, prove Eq. (2.16). Pure states satisfy ρ2θ = ρθ , hence ∂θ ρθ = ∂θ ρ2θ = (∂θ ρθ ) ρθ + ρθ (∂θ ρθ ) . By comparing with the definition of Lθ , ∂ρθ = Lθ ρθ + ρθ Lθ , 2 we see that we can choose Lθ /2 = ∂θ ρθ , and Lθ = 2∂θ ρθ = 2∂θ (|ψθ ihψθ |) = 2 (|∂θ ψθ ihψθ | + |ψθ ih∂θ ψθ |) , where, assuming {|αi} is a fixed (θ-independent) basis of the Hilbert space, ! X X α |∂θ ψθ i = ∂θ ψθ |αi = (∂θ ψθα ) |αi α E. Bagan α 28 2 QUANTUM ESTIMATION 2.1 Frequentist (pointwise) approach and ! h∂θ ψθ | = ∂θ X (ψθα )∗ hα| X = (∂θ ψθα )∗ hα|. α α The QFI is easily derived noticing that for pure states one can write H(θ) = tr [Lθ (∂θ ρθ )] = 1 tr L2θ . 2 Then, H(θ) = 2 tr [(|∂θ ψθ ihψθ | + |ψθ ih∂θ ψθ |) (|∂θ ψθ ihψθ | + |ψθ ih∂θ ψθ |)] = 2 hψθ |∂θ ψθ i2 + h∂θ ψθ |ψθ i2 + 2h∂θ ψθ |∂θ ψθ i = 4 h∂θ ψθ |ψθ i2 + h∂θ ψθ |∂θ ψθ i , where we have used that hψθ |∂θ ψθ i = −h∂θ ψθ |ψθ i, which follows from taking derivative of hψθ |ψθ i = 1. Let us consider the case where the parameter of interest, θ, is the amplitude of a unitary perturbation imprinted to a given initial pure state |ψ0 i. The family of quantum states we are dealing with may be expressed as |ψθ i = Uθ |ψ0 i, where Uθ = exp{−iθH} is a unitary operator and H is the corresponding Hermitian generator (we may think of it as the “Hamiltonian” of the system). This example is of particular interest in metrology. From Eq. (2.16) the QFI can be easily computed to be H(θ) = 4(∆H)2ψθ = 4(∆H)2ψ0 , (2.17) where (∆H)ψ is the standard deviation (uncertainly) of the hermitian operator H in the state |ψi, defined through (∆H)2ψ = hψ|H 2 |ψi − hψ|H|ψi2 (it is just the variance in the quantum mechanical sense). The QCR bound is then var(θ̂) ≥ 1 , 4n(∆H)2ψ0 and we note that it is independent of θ, hence providing a global bound. Exercise 2.4. Derive Eq. (2.17). E. Bagan 29 2 QUANTUM ESTIMATION 2.1 Frequentist (pointwise) approach |∂θ ψθ i = ∂θ Uθ |ψ0 i = −iHUθ |ψ0 i = −iH|ψθ i. h∂θ ψθ |ψθ i = ihψθ |H|ψθ i = ihψ0 |H|ψ0 i, h∂θ ψθ |∂θ ψθ i = hψθ |H 2 |ψθ i = hψ0 |H 2 |ψ0 i. Substituting these expressions in Eq. (2.16) we obtain the desired result H(θ) = 4 (ihψθ |H|ψθ i)2 + hψθ |H 2 |ψθ i = 4 (∆H)2ψθ = 4 (∆H)2ψ0 . The lower bound we have obtained is a function of the reference state. We should now find what is the best state |ψ0 i to estimate θ. We will prove that Claim 2.4 The maximum value that (∆H)ψ can achieve is half of the so called spread of H, namely, |λmax − λmin | (∆H)ψ ≤ . 2 This value is attainable with the choice |ψ0 i = |λmin i + |λmax i √ , 2 (2.18) where |λmin i (|λmax i) is the eigenstate of the minimum (maximum) eigenvalue, λmin , (λmax ) of H. Then, the QCR bound for this family of states reads var(θ̂) ≥ 1 . n(λmax − λmin )2 (2.19) Proof of the claim. Let the spectral decomposition of H be given by X H= λa |λa ihλa |. a A generic state |ψi can be written in this eigenbasis as X |ψi = ψa |λa i. a Then, !2 (∆H)2ψ = X a=0 E. Bagan λ2a pa − X λa pa , (2.20) a 30 2 QUANTUM ESTIMATION 2.1 Frequentist (pointwise) approach P where pa := |ψa |2 ≥ 0 are (of course!) probabilities, a pa = 1. Instead H let us consider the operator H − λmin 1 , H̃ = λmax − λmin whose minim eigenvalue, λ̃min , is zero and maximum eigenvalue, λ̃max , is one. We obviously have 2 (∆H)2ψ = (λmax − λmin )2 ∆H̃ . (2.21) ψ Let us now show that (∆H̃)2ψ = 1/4. Eq. (2.20) is completely general, so it also holds for H̃, P but λ̃2a ≤ λ̃a . Hence, if we define u := a λ̃a pa ≥ 0, we have ∆H̃ 2 ≤ u − u2 , for all u ≥ 0. (2.22) ψ The maximum value of the right hand side is 1/4 (for u = 1/2). Substituting in (2.21) we obtained the desired bound: 2 λmax − λmin 2 (∆H)ψ = . 2 The bound (2.22) is attained iff pa = 0 for all λ̃a with the exception of λ̃b = 1 = λ̃max and λ̃c = 0 = λ̃min (since it follows from the inequality λ̃2a ≤ λ̃a ). Thus, any state of the form |ψi = |λmin i + eiw |λmax i |λ̃min i + eiw |λ̃max i √ √ = . 2 2 attains the maximum. In particular we can choose w = 0. 2.1.3 The Heisenberg limit Now we are going to show something remarkable. Let us go back to Exercise 2.1. The state ρφ is pure, so ρφ = |ψφ ihψφ |, and |ψφ i = |0i + e−iφ |1i √ = Uφ |ψ0 i, 2 where −iφH Uφ = e = |0ih0| + e −iφ |1ih1|, |ψ0 i = |+i, H = |1ih1| ⇒ λmin = 0 . λmax = 1 Using the QCR bound derived in Eq. (2.19) we recover the result of the exercise, namely var(φ̂) ≥ E. Bagan 1 1 = . 2 n(1 − 0) n 31 2 QUANTUM ESTIMATION 2.2 Bayesian approach and this bound is attainable. If we repeat the same experiment (measuring n copies to estimate φ) N times the variance would, of course, be bounded as var(φ̂) ≥ 1 . Nn (2.23) Suppose, however, that we proceed in a different way. Instead of preparing a product state, |ψ0 i⊗n , and measuringP each copy separately, we view the n copies as a whole system, S, with “Hamiltonian” HS = nk=1 Hk , where Hk = |1ik h1| ⊗ 1S−{k} is the “Hamiltonian” of the k-th qubit, and choose the fiducial state as in (2.18). In this case |0i⊗n + |1i⊗n |λmin i + |λmax i λmin = 0 √ √ = ⇒ |ψ0 i = λmax = n 2 2 (note it is a highly entangled –hence very fragile– state!). Then, according to (2.19), if we repeat the experiment N times we will get a much enhanced sensitivity, with a variance scaling quadratically with the inverse of the size of the system var(θ̂) ≥ 1 1 = . 2 N (n − 0) N n2 We refer to this behavior, var(θ̂) ∼ 1/n2 , as the Heisenberg limit, in contrast to the standard quantum (or shot-noise) limit, where var(θ̂) ∼ 1/n [as in Eq. (2.23)]. In quantum metrology based on interferometry, the Heisenberg limit can be achieved using squeezed states (instead of the “classical” coherent states that have shot-noise limited sensitivity). This falls beyond the scope of this course and will not be discussed here. 2.2 Bayesian approach To give a flavor of what the Bayesian approach is about, let us look at the pure state model of Section 2.1.2 from this point of view. We will use the cost function (1.8) introduced in Example 1.5. The averaged cost function is ! Z 2π Z dφ φ̂ − φ x CH (φ̂) = 4 sin2 tr Ex Uφ |ψ0 ihψ0 |Uφ† dm x, 2π 2 0 where |ψ0 i ∈ (C2 )⊗n and we emphasize that the estimate φ̂ depends on the outcomes x by writing φ̂x . We assume, without loss of generality, that the outcomes, x, of the measurement are continuous random variables (vectors of some dimension m; the “volume element” dm x R m is normalized such that d x = 1). This can be witten as Z 2π Z dφ CH (φ̂) = 2 1 − cos(φ̂x − φ) hψ0 |Uφ† Ex Uφ |ψ0 idm x 2π 0 Z 2π Z dφ † iφ̂x −iφ m =2−< e e hψ0 |Uφ Ex Uφ |ψ0 id x . 2π 0 E. Bagan 32 2 QUANTUM ESTIMATION 2.2 Bayesian approach The unitary matrix Uφ acts on (C2 )⊗n , so it can be written as Uφ = n X e−ikφ |kihk|, (2.24) k=0 where each |ki spans the (one-dimentional) irreducible representations of the unitary group U (1). Explicitly they are −1/2 n |ki = eiϕk |0i ⊗ · · · ⊗ |0i ⊗ |1i ⊗ · · · ⊗ |1i +permutations . | {z } k (2.25) k The normalization coefficient comes about because there are nk different orthogonal terms on the right hand side and the phases ϕk are arbitrary; we can choose as we wish. We now note that Z 2π n X dφ −iφ † e Uφ ⊗ Uφ = |k + 1ihk + 1| ⊗ |kihk|. (2.26) 2π 0 k=0 R 2π This is so because 0 dφ/(2π) exp(isφ) = 0 for any s ∈ Z, s 6= 0. Using this property, we readily see that ( n Z ) X CH (φ̂) = 2 − 2< eiφ̂x ck+1 ck (Ex )k+1,k dm x . (2.27) k=0 Here we have introduced the definitions (Ex )k+1,k := hk + 1|Ex |ki and ck := hk|ψ0 i, where the arbitrary phases ϕk in (2.25) have been chosen so that ck ≥ 0. We know that these phases have no physical relevance, so this choice cannot affect our result. Now, note the following chain of inequalities: CH (φ̂) ≥ 2 − 2 n Z X eiφ̂x ck+1 ck (Ex )k+1,k dm x k=0 ≥2−2 ≥2−2 n Z X k=0 n X ck+1 ck |(Ex )k+1,k | dm x Z q q ck+1 ck (Ex )k,k (Ex )k+1,k+1 dm x k=0 ≥2−2 n X sZ ck+1 ck Z (Ex )k,k dm x (Ex )k+1,k+1 dm x k=0 =2−2 n X ck+1 ck . (2.28) k=0 In the first line we have used that |<(z)| ≤ |z| for any z ∈ C. The triangle inequality led to the second line. The positivity condition Ex ≥ 0 implies |(Ex )k+1,k |2 ≤ (Ex )k,k (Ex )k+1,k+1 , (Ex )k,k ≥ 0, (Ex )k+1,k+1 ≥ 0, which enabled us to write the third inequality. Schwarz E. Bagan 33 2 QUANTUM ESTIMATION 2.2 Bayesian approach R inequality led to the forth line, and finally, the POVM condition Ex dm x = 1 enabled us to get rid of the POVM operators in the last line and got an absolute bound. Attainability is shown by providing an explicit measurement that saturates the bound, e.g., Eφ̂ = Uφ̂ |Φ0 ihΦ0 |Uφ̂† , where n X |Φ0 i = |ki k=0 (note |Φ0 i is not normalized). Exercise 2.5. Check that the set of operators {Eφ̂ | φ̂ ∈ [0, 2π)} defines a (continuous) POVM. Show that it saturates the bound in Eq. (2.28). The operators Eφ̂ are manifestly positive, so we only need to check that they add up to the identity operator: Z Z dφ̂ dφ̂ Eφ̂ = Uφ̂ |Φ0 ihΦ0 |Uφ̂† . 2π 2π From Eq. (2.24) we have Z 0 2π n n XX dφ̂ |kihk| ⊗ |lihl| Uφ̂ ⊗ Uφ̂† = 2π k=0 l=0 e−i(k−l)φ̂ 0 hence Z dφ̂ E = 2π φ̂ n 2π Z dφ̂ X |kihk| ⊗ |kihk|, = 2π k=0 n Z X dφ̂ Uφ̂ |Φ0 ihΦ0 |Uφ̂† = |hk|Φ0 i|2 |kihk|. 2π k=0 But hk|Φ0 i = hk| n X ! |li = hk|ki = 1. l=0 Substituting this in the previous equation we get Z n X dφ̂ Eφ̂ = |kihk| = 1. 2π k=0 So the set {Eφ̂ | φ̂ ∈ [0, 2π)} defines a proper POVM. Let us now check that it saturates de bound. For our POVM Eq. (2.27) can be written as ( n ) Z 2π X dφ̂ iφ̂ CH (φ̂) = 2 − 2< ck+1 ck e (Eφ̂ )k+1,k . 2π 0 k=0 E. Bagan 34 2 QUANTUM ESTIMATION 2.2 Bayesian approach Let us compute the integral: ! Z 2π Z 2π dφ̂ iφ̂ dφ̂ iφ̂ † e (Eφ̂ )k+1,k = hk + 1| e Uφ̂ |Φ0 ihΦ0 |Uφ̂ |ki 2π 2π 0 0 ! n X X = hk + 1| |l + 1ihl + 1|Φ0 ihΦ0 |lihl| |ki l l=0 = hk + 1|Φ0 ihΦ0 |ki = 1 [we have used the hermitian conjugate of Eq. (2.26) in the second line]. Hence ( n ) n X X CH (φ̂) = 2 − 2< ck+1 ck = 2 − 2 ck+1 ck . k=0 k=0 Our last task is to minimize the cost function CH (φ̂). Since ck are the components of the referenceP state |ψ0 i in the basis {|ki}nk=0 and they are real (because of our choice of phases) we have nk=0 c2k = 1. So, 2 −1 0 0 ··· 0 0 c0 −1 2 −1 0 ··· 0 0 0 −1 2 −1 . . . 0 0 c1 . (2.29) CH (φ̂) ≥ c0 c1 . . . cn .. . . . . . .. .. .. .. .. .. . . 0 0 0 · · · −1 2 −1 cn 0 0 0 ··· 0 −1 2 The minimum of this quadratic form is given by the minimum eigenvalue of the symmetric matrix on the right hand side. Exercise 2.6. Let Mij and {ck }nk=1 real PnM be2 a n × n symmetric matrix with entries P n coefficients so that k=1 ck = 1. Show the the minimum of C = i,j=1 cj Mjk ck is the minimum eigenvalue of M . Diagonalizing the matrix in (2.29) is an easy task that surely enough you have accomplished in your classical mechanics courses when dealing with coupled oscillators, as it shows up when a chain of n + 1 equal masses are connected with springs of equal strength (the loaded string). The eigenvalues are there computed to be1 jπ 2 λj = 4 sin , j = 1, 2, . . . , n + 1. 2(n + 2) By borrowing this result we obtain that the minimum cost is π min 2 CH (φ̂) = 4 sin . 2(n + 2) 1 See, e.g., Classical Dynamics of particles and systems (fifth edition), Stephen T. Thornton and Jerry B. Marion, Brooks/Cole (2004) E. Bagan 35 2 QUANTUM ESTIMATION 2.2 Bayesian approach Exercise 2.7. Find the optimal reference state |ψ0 i. Namely, compute the coefficients ck for which the minimum cost is attained. Quite remarkably this average cost is exact for all n ! One can easily verify that min CH (φ̂) π2 ∼ 2 n as n → ∞. Since in average φ̂ is very close to φ asymptotically, we have min MSE(θ̂) ≈ CH (φ̂) ∼ π2 n2 as n → ∞. This tell as that the protocol consisting in preparing a system of n qubits in the state |ψ0 i and performing the measurement defined by {Eφ̂ | 0 ≤ φ̂ < 2π}, whose outcome is the estimate, φ̂, has Heisenberg-limited sensitivity. The factor π 2 is the price we pay for using a global estimator. Bibliography [1] Probability and Statistics II. Notes by G. Deligiannidis. [2] Quantum estimation for quantum technology, M. G. A. Paris, arXiv:0804.2981v3 [3] Precision bounds in noisy quantum metrology, J. Kolodynski, arXiv:1409.0535v2 [4] Nonlinear quantum metrology, S. Boixo, PhD thesis, University of New Mexico (2008). E. Bagan 36