Uploaded by Emili Bagan

Quantum Parameter Estimation

advertisement
Quantum Parameter Estimation∗
Emili Bagan
Grup d’Informació Quàntica (GIQ)
Departament de Fı́sica
Universitat Autònoma de Barcelona
November
2021
1
Classical Estimation
1.1
Introduction
We will assume that the probability mass function PMF (probability density function PDF)
of a discrete (continuous) random variable, X, depends on a parameter, θ, or a set of
parameters (parameter vector), θ = (θ1 , θ2 , . . . , θp ). These notes are mainly devoted to the
single parameter case.
Recall:
• PMF: pX (xk ) such that pX (xk ) := Pr(X = xk ) = pk , for all values of k.
Z x
• PDF: fX (x) such that
fX (x0 )dx0 = Pr(X ≤ x) := FX (x)
−∞
The function FX (x) is called the distribution function or the cumulative distribution function CDF of the random variable X. Note that FX0 (x) = fX (x). Unless it is necessary for
clarity, we will often drop the subscript X and simply write f (x), F (x), and so on. In these
notes we (mostly) focus on the continuous case, but one can check that the results we will
derive hold also
random variables by replacing the PDFs, f (x; θ), by the PMF,
R for discrete
P
p(x, θ) and · dx → x .
∗
Master in Quantum Science and Technology. Quantum Theory: Quantum Statistical Inference.
1
1 CLASSICAL ESTIMATION
1.1 Introduction
Example 1.1. We say that X is normally distributed, X ∼ N(µ, σ 2 ), if its PDF is
(
2 )
1 x−µ
1
exp −
,
f (x; θ) = f (x; µ, σ 2 ) = √
2
2
σ
2πσ
where µ = E(X) is the mean and σ 2 = var(X) = E[(X − µ)2 ] = E(X 2 ) − [E(X)]2 is the
variance. We may view the mean and the variance as parameters: θ = (µ, σ 2 ).
Recall that E stands for expectation value:
Z
E[g(X)] =
∞
g(x)fX (x)dx.
−∞
Often throughout these notes, we will use boldface to denote a collection or a vector of random variables, X = (X1 , . . . , Xn ). Likewise x will be also used to denote the corresponding
vector of outcomes/observation, x = (x1 , . . . , xn ). The joint PDF will be denoted by fX (x).
Hence, e.g., the expectation value of g(X) will be
Z
E[g(X)] = g(x)fX (x)dn x,
where dn x = dx1 dx2 · · · dxn .
The aim of (parameter) estimation is to accurately determine the value of θ from observations, i.e., from a set x of outcomes or realizations of the random variable X. We will refer
to this set as sample.
The relevance of estimation for quantum information should be obvious. A quantum state ρ
is just a collection of parameters (its independent entries ρab , for instance) that describe our
acknowledge about a system. To emphasize this fact, we could write ρθ instead of just ρ. In
order to have a precise mathematical description of the state, an accurate estimation of these
parameters θ is required. We can only perform measurements on the system to reveal this
information. According to quantum mechanics, their outcomes are random variables, whose
probability distributions are given by the Born rule, p(x; θ) = tr(ρθ Ex ), where {Ex } is a
collection of operators defining a positive operator-valued measure (POVM) and characterizing the measurement. The classical estimation toolbox, which we are about to introduce,
provides us with the means to optimally extract θ from our measurement data. Note that
the distribution p(x; θ) also depends on our choice of measurement. But what is the best
measurement we can perform on a system to estimate θ? The aim of quantum estimation
is to provide means to answer this question.
In a more complex scenario, we may wish to characterize the action of a channel Cθ . To
do so, we may feed the channel with a system prepared in a fiducial or reference state, ρ0 ,
and perform a measurement on the output state ρθ = Cθ (ρ0 ). For a fixed ρ0 and a fixed
measurement, classical estimation will provide us with the tools to obtain the most precise
determination of the unknown θ.
E. Bagan
2
1 CLASSICAL ESTIMATION
1.2 Frequentist approach
There are several approaches to classical estimation. We will focus on two, which we will
refer to as frequentist approach and Bayesian approach.
1.2
Frequentist approach
Within the frequentist approach the estimated parameter is assumed to be a deterministic
variable with a fixed value.
Definition 1.1 Given a sample of random variables (possible outcomes) X = (X1 , . . . , Xn ),
a statistic Y is a known function of the sample
Y = f (X) .
When the statistic is used to estimate the value of a parameter (vector) θ then it is also
called a point estimate, or estimator and it is usually denoted by θ̂.
Note that a statistic is a random variable itself.
In these notes we will always assume that Xi are independent and identically distributed
(commonly abbreviated i.i.d.).
Example 1.2. It is well known that the sample mean (average),
n
X̄ :=
1X
Xi ,
n i=1
is a “good” estimator of µ. Likewise,
n
2
1 X
S :=
Xi − X̄
n − 1 i=1
2
is a “good” estimator of the variance. Both X̄ and S 2 are statistics, since they just depend
on X1 , X2 , . . . , Xn .
To give a precise meaning to “good” above, we need to discuss some properties of the
estimators.
Definition 1.2 (Bias) The bias of an estimator θ̂ of a parameter θ is defined as
Bias(θ̂) = E(θ̂ − θ).
If Bias(θ̂) = 0 then we say that the estimator is unbiased.
So, if an estimator is unbiased, in average, it does give the right estimate, which, of course
is a desirable property.
E. Bagan
3
1 CLASSICAL ESTIMATION
Example 1.3.
since
1.2 Frequentist approach
The sample mean X̄ is an unbiased estimator of the distribution mean µ,
n
E(X̄) =
1X
E(Xi ) = E(X) = µ
n i=1
by (obvious) linearity of E. Likewise, S 2 is unbiased.
Exercise 1.1. Show that S 2 is an unbiased estimator of var(X).
We must show that E(S 2 ) = var(X).
n
X
n
n
n
X
X
X
2
2
(Xi − X̄) =
(Xi − µ) −
(X̄ − µ) − 2
(Xi − X̄)(X̄ − µ)
2
i=1
=
=
i=1
n
X
i=1
i=1
(Xi − µ)2 − n(X̄ − µ)2 − 2(X̄ − µ)
i=1
n
X
n
X
!
Xi − nX̄
i=1
(Xi − µ)2 − n(X̄ − µ)2 .
i=1
We next take expectation values and recall that µ = E(X) = E(X̄):
(n − 1) E(S 2 ) = n var(X) − n var(X̄) = (n − 1) var(X),
where we have used that
n
X
1
Xi
var(X̄) = 2 var
n
i=1
!
=
1
var(X).
n
The estimates obtained from our samples will be always subject to errors, so we need to
quantify them in a suitable way.
Definition 1.3 (Mean square error) The mean square error of an estimator θ̂ is
h
i
MSE(θ̂) = E (θ̂ − θ)2 .
One can immediately check that
MSE(θ̂) = var(θ̂) + Bias(θ̂)2 .
(1.1)
Exercise 1.2. Check that Eq. (1.1) holds.
E. Bagan
4
1 CLASSICAL ESTIMATION
1.2 Frequentist approach
h
i
2 2
E (θ̂ − θ) = E θ̂ − E(θ̂) + E(θ̂) − θ
h
i2
h
i h
i
= var(θ̂) + E(θ̂) − θ + 2 E(θ̂) − θ E θ̂ − E(θ̂)
h
i2
= var(θ̂) + E(θ̂ − θ) + 2 E(θ̂ − θ) × 0
= var(θ̂) + Bias(θ̂)2 .
A good estimator is one that has small MSE. If it is unbiased, this is tantamount to having
small variance. Notice that var(θ̂) can be determined from the data, wheres MSE(θ̂) cannot,
since in practical applications θ is, of course, unknown.
Often the goodness of an estimator (or rather of a sequence of estimators) improves as the
sample size, n, increases. The next definition captures this idea.
Definition 1.4 (Consistency) A sequence of statistics (Yn , n ∈ N) is said to be a consistent
estimate of a parameter θ if for every > 0
lim Pr (|Yn − θ| ≤ ) = 1.
n→∞
Equivalently, we may write the condition as
lim Yn = θ
n→∞
(in probability).
The sequence {Yn ; n ∈ N} could, according to our notation, be denoted by θ̂n , and we will
often (but not always) do so, particularly if we want to emphasize that each yn is an estimate
of θ and also indicate that it is based on a sample of size n.
Exercise 1.3. Show that if Yn is a consistent estimator of θ with E(Yn2 ) < ∞, then
lim E(Yn − θ) = 0.
n→∞
We first note that there exists a finite C such that for all n
p
E (Yn − θ)2 ≤ E(Yn2 ) + θ2 + 2|θ|| E(Yn )| ≤ E(Yn2 ) + θ2 + 2|θ| E(Yn2 )
h
i2
p
= |θ| + E(Yn2 ) ≤ C,
where in the second inequality we have used that
p
E(|Yn |) ≤ E(Yn2 )
E. Bagan
5
1 CLASSICAL ESTIMATION
1.2 Frequentist approach
(as follows immediately from Jensen inequality). We next use Cauchy-Schwarz inequality,
[E(|XY |)]2 ≤ E X 2 E Y 2 ,
to get that, for any > 0,
[E(|Yn − θ|)1{|Yn − θ| ≥ /2}]2 ≤ E |Yn − θ|2 E (1{|Yn − θ| ≥ /2})
= E (Yn − θ)2 Pr (|Yn − θ| ≥ /2)
≤ C Pr (|Yn − θ| ≥ /2) ,
where 1{· · · } is the indicator function. With this,
E (|Yn − θ|) = E [|Yn − θ| 1 {|Yn − θ| < /2}] + E [|Yn − θ| 1 {|Yn − θ| ≥ /2}]
≤ /2 + C Pr (|Yn − θ| ≥ /2) .
(1.2)
But
lim Yn = θ
n→∞
(in probability)
⇒
lim Pr (|Yn − θ| ≥ /2) = 0.
n→∞
This implies that there exists N ∈ N such that for any > 0, Pr (|Yn − θ/2| ≥ ) < /(2C)
provided n > N . Hence, from Eq. (1.2) we have
|E(Yn − θ)| ≤ E (|Yn − θ|) <
+C
= ,
2
2C
which means that
lim E(Yn − θ) = 0.
n→∞
It is interesting to note that the claim of the exercise ceases to be true if we drop the
condition E(Yn2 ) < ∞.
Exercise 1.4. Consider the sequence of random variables Yn with probability distribution

n−1


if y = θ


n


1
pYn (y; θ) =
if y = θ + n


n




0 otherwise
1. Show that limn→∞ Pr (|Yn − θ| ≤ ) = 1, but limn→∞ E(Yn − θ) 6= 0.
2. Modify the distribution slightly to show that limn→∞ Pr (|Yn − θ| ≤ ) = 1 does not necessarily imply limn→∞ var(Yn ) = 0, even if limn→∞ E(Yn − θ) = 0 and E(Yn2 ) < ∞.
E. Bagan
6
1 CLASSICAL ESTIMATION
1.2 Frequentist approach
(a) If 0 < < 1
Pr (|Yn − θ| ≤ ) = Pr (Yn = θ) = pYn (θ; θ) =
n−1
,
n
hence limn→∞ Pr (|Yn − θ| ≤ ) = 1. However
E (Yn − θ) = (θ − θ) pYn (θ; θ) + (θ + n − θ) pYn (θ + n; θ)
1
= n · = 1 6= 0.
n
(b) Consider
 2
n −1


if y = θ


2

 n
1
pYn (y; θ) =
if y = θ + n


n2




0 otherwise
Then
E (Yn − θ) = (θ − θ) pYn (θ; θ) + (θ + n − θ) pYn (θ + n; θ)
1
= n · 2 → 0.
n
2
θ
n
−1
2 1
2
E Yn2 = θ2
+
(θ
+
n)
=
θ
+
2
+ 1 → θ2 + 1
2
2
n
n
n
Note that as a consequence of the result of Exercise 1.3 any consistent estimator is asymptotically unbiased, in the sense that limn→∞ E(θ̂n ) = θ. Hence, although the biased estimators
may lead to improved precision, they may be ignored in the n → ∞ limit for which the
frequentist approach is really designed.
In dealing with consistency it might be useful to introduce the famous law of large numbers
as follows.
Theorem 1.5 (Law of Large Numbers) Suppose that {Xi , i ≥ 0} are i.i.d. with finite mean
µ and variance σ 2 . Then
n
1X
µ̂n :=
Xi
n i=1
is a consistent estimator of the mean. In other words for all > 0
Pr (|µ̂n − µ| > ) → 0
as n → ∞.
The Law of Large Numbers follows from Chebyshev’s Inequality.
E. Bagan
7
1 CLASSICAL ESTIMATION
1.2 Frequentist approach
Proposition 1.6 (Chebyshev’s Inequality) Let Y be a random variable with finite mean µ
and variance σ 2 . Then for any k > 0,
1
Pr (|Y − µ| ≥ kσ) ≤ 2 .
k
This proposition in turn follows from Markov’s Inequality:
Theorem 1.7 (Markov’s Inequality) Let X > 0 be a random variable, such that E(X) < ∞
and c > 0 a constant. Then
E(X)
.
Pr(X > c) ≤
c
Exercise 1.5. Prove Markov’s Inequality and the following slightly more general statement:
If X > 0, k ≥ 1 and E(X k ) < ∞, then
E(X k )
P(X > c) = P X k > ck ≤
.
ck
Z
∞
Z
∞
fX (x)dx ≤
Pr(X > c) =
c
c
xk
fX (x)dx ≤
ck
Z
0
∞
E(X k )
xk
fX (x)dx =
ck
ck
Exercise 1.6. Prove Chebyshev’s Inequality.
Exercise 1.7.
By using Chebyshev’s inequality, show that if limn→∞ var(θ̂n ) = 0 then
asymptotic unbiasedness implies consistency.
Exercise 1.8. Prove the Law of Large Numbers.
In addition to consistency and unbiasedness, a good estimator should have a small mean
square error which for unbiased estimators is just the variance. This motivates the following
Definition 1.8 (Relative Efficiency) Given two estimators θ̂1 and θ̂2 of a parameter θ, the
relative efficiency of θ̂1 relative to θ̂2 , is denoted by eff(θ̂1 , θ̂2 ) and is defined as
MSE θ̂2
.
eff θ̂1 , θ̂2 =
MSE θ̂1
For unbiased estimators it is equivalent to
eff θ̂1 , θ̂2
E. Bagan
var θ̂2
.
=
var θ̂1
8
1 CLASSICAL ESTIMATION
1.2 Frequentist approach
Exercise 1.9. Let X1 , . . . , Xn ∼ U(0, M ) the uniform distribution on (0, M ). Consider he
estimators
n
1X
Mn
θ̂1 := X̄ =
Xi , θ̂2 := cn
,
n i=1
2
where
Mn := max{Xi }ni=1 .
and cn is some judiciously chosen n-dependent normalization coefficient. Give cn so that θ̂2
is unbiased. Then compute eff(θ̂1 , θ̂2 ).
At this point, the question arises as to how to construct consistent unbiased and effective
estimators. Before we attempt to answer this question we still need to give a few definitions.
Definition 1.9 (Likelihood function for discrete PD) Suppose X = (X1 , . . . , Xn ) are discrete random variables whose distribution depends on a parameter(vector) θ, and have probability mass function
p (x; θ) := Pr (X = x; θ) ,
where x = (x1 , . . . , xn ) are sample observations. The likelihood of the parameter(vector) θ
given the observations x is denoted by L (θ | x) is defined to be
L (θ | x) := p (x; θ)
that is the joint probability mass function for the parameter(vector) θ.
Definition 1.10 (Likelihood function for continuous PD) Suppose X = (X1 , . . . , Xn ) are
jointly continuous random variables whose distribution depends on a parameter(vector) θ,
and have PDF f (y; θ). The likelihood of the parameter(vector) θ given the observations x
is denoted by L (θ | x) is defined to be
L (θ | x) := f (x; θ)
that is the joint density for the parameter(vector) θ evaluated at the observations.
Definition 1.11 (Log-likelihood function) The log-likelihood function of the parameter (vector) θ given the observations x is defined as
l(θ) = l (θ | x) = log L (θ | x) ,
where L (θ | x) is the corresponding likelihood function.
If X1 , . . . , Xn are i.i.d. and Xi ∼ f (· ; θ), then
L (θ | x) =
n
Y
f (xi ; θ) ;
i=1
l (θ | x) =
n
X
log [f (xi ; θ)] .
i=1
and similarly for discrete random variables.
We think of the likelihood function as a function of θ, and we treat the observations as fixed.
Sometimes we will drop the observations and simply write L(θ) and l(θ).
E. Bagan
9
1 CLASSICAL ESTIMATION
1.2 Frequentist approach
Definition 1.12 (Maximum Likelihood Estimator) Suppose that a sample x = (x1 , . . . , xn )
has likelihood function L(θ) = L (θ | x) depending on a parameter(vector) θ. Then a maximum likelihood estimator (MLE) θ̂ MLE is the value of the parameters that maximizes L(θ),
if a maximum exists. In other words
θ̂ MLE = arg max L(θ | x) = arg max l(θ | x)
θ
θ
The maximum of L(θ) may not exist, in which case the MLE cannot be constructed. The
maximum, if it exists, may not be unique, in which case we will obtain several MLEs. Note
that these are not the values of the parameters that are most likely, given the data. To start
with, θ in not a random variable in the frequentist approach we are discussing!
Theorem 1.13 (Invariance of MLE) Suppose that θ̂ is the MLE for a parameter θ and let
t(·) be a strictly monotone function of θ. Then
(t(θ))MLE = t(θ̂),
i.e., the MLE of t(θ) is t(θ̂).
Exercise 1.10. Consider the exponential distribution f (x | λ) = λe−λx . Suppose we take a
sample of size n. Show that the MLE of λ is λ̂ = 1/X̄.
The likelihood, is
L (λ | x1 , . . . , xn ) =
n
Y
n
X
−λxi
n
λe
= λ exp −λ
xi
i=1
!
= λn exp(−nλx̄)
i=1
Then
l = n log λ − nλx̄
and so
n
d
l = − nx̄
dλ
λ
Thus L has a unique maximum at λ̂ = 1/x̄ and this is therefore the maximum likelihood
estimator of λ.
Exercise 1.11. Find the maximum likelihood estimates of the parameters of the normal
distribution. Note that the MLE of σ 2 is a biased estimator.
n
n
1 X
l = − log(2π) − n log σ − 2
(xi − µ)2
2
2σ i=1
E. Bagan
10
1 CLASSICAL ESTIMATION
1.2 Frequentist approach
Differentiating with respect to each parameter and setting equal to zero:
n
1 X
(xi − µ) = 0;
σ 2 i=1
n
1 X
n
(xi − µ)2 = 0.
− + 3
σ σ i=1
It follows that
n
µ̂ =
1X
xi
n i
σ̂ 2 =
We note that
σ̂ 2 =
1X
(xi − x̄)2 .
n
n−1 2
S .
n
Since S 2 is unbiased, σ̂ 2 must be biased.
Proposition 1.14 (the MLE of an i.i.d observation is consistent) Let {X1 , . . . , Xn } be a
sequence of i.i.d. observations where
i.i.d.
Xk ∼ f (x; θ).
Then the MLE of θ is consistent
Let x1 , . . . , xn be a sample drawn from a population with PDF fθ (x). When the sample is
used to estimate the parameter θ, an obvious question arises: What is the lowest variance
we can achieve?
Definition 1.15 (Fisher Information) Let X ∼ f (· ; θ). Then the Fisher Information is
given by
"
2 #
∂l(θ | x)
In (θ) := n E
,
∂θ
where l(θ | x) is the log-likelihood.
It can be shown that if the second partial derivative exists then we also have that
2
∂
l(θ | x) .
In (θ) = −n E
∂θ2
Exercise 1.12. Prove this last statement.
Z
Z
∂2
∂θ f (x; θ)
2
E
l(θ | x) = f (x; θ)∂θ log [f (x; θ)] dx = f (x; θ)∂θ
dx
∂θ2
f (x; θ)
(
)
Z
∂θ2 f (x; θ) [∂θ f (x; θ)]2
−
dx,
= f (x; θ)
f (x; θ)
f 2 (x; θ)
E. Bagan
11
1 CLASSICAL ESTIMATION
1.2 Frequentist approach
where we have used the obvious notation ∂θ := ∂/∂θ. The first term in the integral vanishes,
since
Z
Z
Z
∂θ2 f (x; θ)
2
2
f (x; θ)
dx = ∂θ f (x; θ)dx = ∂θ f (x; θ)dx = ∂θ 1 = 0.
f (x; θ)
Finaly, the second term can be written as
Z
−
∂θ f (x; θ)
f (x; θ)
f (x; θ)
2
Z
dx = −
f (x; θ) [∂θ l(θ | x)]2 dx = − E
"
∂l(θ | x)
∂θ
2 #
.
Theorem 1.16 (Cramer-Rao bound) Let X = (X1 , . . . , Xn ) be i.i.d. with probability density
function f (x; θ). Let θ̂n = g (X) be an unbiased estimator of θ, such that the support of g(X)
(the region for which the probability is not zero) does not depend on θ. Then under mild
conditions we have that
1
var θ̂n ≥
.
In (θ)
The proof relies on the following very general theorem/definition
Theorem 1.17 Let X and Y be two random variables. Then, their correlation coefficient,
defined as
cov(X, Y )
,
corr(X, Y ) := p
var(X) var(Y )
satisfies
1 ≤ corr(X, Y ) ≤ 1.
We recall that cov(X, Y ) = E(XY ) − E(X) E(Y ), cov(X, X) = var(X). The content of this
Theorem is that cov actually obeys the Cauchy-Schwarz inequality.
Exercise 1.13.
Prove Theorem 1.17. Hint. First check that linearity of the expectation, E, implies corr(aX + c, bY + c0 ) = corr(X, Y ), hence, it suffices to prove the theorem
assuming E(X) = E(Y ) = 0 and var(X) = var(Y ) = 1. Next, consider the trivial inequality
0 ≤ E [(X − λY )2 ], where λ ∈ R.
One can immediately check that from the very definition of E one has E(aX +c) = a E(X)+c.
Then,
var(aX + c) = E{[(aX + c − E(aX + c)]2 } = a2 E{[(X − E(X)]2 } = a2 var(X),
and
E[(aX + c)(bY + c0 )] = E[abXY + cbY + c0 aX + cc0 ]
= ab E(XY ) + cb E(Y ) + c0 a E(X) + cc0
= ab[E(XY ) − E(X) E(Y )] + [a E(X) + c][b E(Y ) + c0 ],
E. Bagan
12
1 CLASSICAL ESTIMATION
1.2 Frequentist approach
which proves that
cov(aX + c, bY + c0 ) = ab cov(X, Y ).
Hence, corr(aX + c, bY + c0 ) = corr(X, Y ).
We see that
X − E(X)
X0 = p
;
var(X)
Y − E(Y )
Y0 = p
var(Y )
⇒
corr(X 0 , Y 0 ) = corr(X, Y )
and E(X 0 ) = E(Y 0 ) = 0 and var(X 0 ) = var(Y 0 ) = 1. This proves the first statement in the
exercise.
Next, we have
0 ≤ E (X − λY )2 = λ2 − 2λ E(XY ) + 1.
For this to hold, the polynomial in λ on the right hand side can have at most one root, which
implies that the discriminant must be non-positive: 1 ≥ [E(X, Y )]2 = [corr(X, Y )]2 .
Proof of the Cramer-Rao Bound (CRB).
Consider the random variable W defined by
∂θ f (X; θ)
,
f (X; θ)
Q
where X = (X1 , . . . , Xn ), f (x; θ) is the joint PDF, f (x; θ) = ni=1 f (xi ; θ), and ∂θ := ∂/∂θ.
Hence
Z
Z
Z
∂θ f (x; θ)
d
n
n
E(W ) =
f (x; θ)d x = ∂θ f (x; θ)d x =
f (x; θ)dn x = 0
f (x; θ)
dθ
W = ∂θ log f (X; θ) =
under fairly general conditions that guarantee
we
differentiation and integra
canexchange
tion. Since E(W ) = 0, we have that cov W, θ̂n = E W θ̂n , thus
cov W, θ̂n
Z
∂θ f (x; θ)
= g (x)
f (x; θ)dn x =
f (x; θ)
d
dθ
= E(θ̂n ) =
= 1.
dθ
dθ
Z
d
g (x) ∂θ f (x; θ)d x =
dθ
n
Z
g(x)f (x; θ)dn x
From Theorem 1.17 we have
1 ≥ [corr(W, θ̂n )]2 =
cov2 (W, θ̂n )
var(W ) var(θ̂n )
,
which, since cov(W, θ̂n ) = 1, implies that
var(θ̂n ) ≥
E. Bagan
1
.
var(W )
13
1 CLASSICAL ESTIMATION
1.2 Frequentist approach
Note that up to this point we have not used the i.i.d. condition, hence, the last bound holds
in the general situation. Assuming
Qn now that X1 , . . . , Xn are i.i.d., we know that the joint
distribution is simply f (x; θ) = i=1 f (xi ; θ), therefore
W = ∂θ
n
X
log f (Xi , θ) =
i=1
n
X
∂θ log f (Xi , θ). =:
i=1
n
X
Wi
i=1
Using again the independency condition, we have
var(W ) =
n
X
var(Wi ) = n E [∂θ log f (x, θ)]2 = n E [∂θ l(θ | x)]2 = In (θ).
i=1
This completes the proof.
In particular, the first equality in the last line of the proof states that (additivity)
(1,2)
I1
(1)
(2)
(θ) = I1 (θ) + I1 (θ)
for independent random variables with joint distribution fX1X2(x1 ,x2 ;θ) = fX1 (x1 ;θ)fX2 (x2 ;θ).
Being non-negative [as follows from (its very) Definition 1.15] and additive the FI has the
interpretation of an information measure. Its increase indicates that a higher precision
is potentially achievable in parameter estimation. In particular, at a given θ0 , In (θ0 ) = 0
proves that one cannot extract any information about the parameter from a sample, whereas
divergent In (θ0 ) = ∞ implies that the true value θ0 can in principle be perfectly determined.
Definition 1.18 (Efficiency) The efficiency of an unbiased estimator θ̂n of a parameter θ
is defined as the ratio of the Cramer-Rao bound to the variance of θ̂n , that is
1
.
eff θ̂n =
In (θ) var θ̂n
An estimator which has unit efficiency [the maximum value eff(θ̂n ) can take] is called efficient.
Exercise 1.14. Let X1 , . . . , Xn be i.i.d. with PDF f (x; λ) = λe−λx . In Exercise 1.10P
it was
shown that λ̂MLE = 1/X̄. Show now that λ̂MLE is not unbiased, whereas λ̂n = (n − 1)/( ni=1 Xi ) is.
Show that
2
eff(λ̂n ) = 1 − .
n
This is less than unity, and hence, it is not efficient. However, the efficiency approaches
unity as n → ∞. In such cases we say that λ̂n is an asymptotically efficient estimator.
Let us compute E(λ̂MLE ). We will do it brute force:
Z
Pn
n
Pn
E(λ̂MLE ) =
λn e−λ i=1 xi dn x.
Rn
i=1 xi
+
E. Bagan
14
1 CLASSICAL ESTIMATION
1.2 Frequentist approach
Insert the identity
∞
Z
δ(
Pn
i=1
xi − w) dw = 1,
0
where δ(x) is the Dirac delta function (distribution). We have
Z ∞ −λw
Z
P
e
n
δ ( ni=1 xi − w) dn x.
E(λ̂MLE ) = nλ
dw
w
0
Rn
+
Scale xi as xi = wyi , then dn x = wn dn y, and
Z ∞
Z
n
n−1 −λw
E(λ̂MLE ) = nλ
w e
dw
Rn
+
0
= nλn
Z
∞
wn−2 e−λw dw
Z
δ(
Rn
+
0
P
δ [w ( ni=1 yi − 1)] dn y
Pn
i=1
yi − 1) dn y
= n(n − 2)!λ vol(∆n ),
P
where vol(∆n ) is the volume of the simplex ∆n = {(y1 , . . . , yn ) | nk=1 yk = 1}. Using the
same trick with E(1) = 1 we have
Z
Z
Z ∞
P
P
n−1 −λw
n −λ n
n
xi n
i=1
w e
dw
δ ( ni=1 yi − 1) dn y = (n − 1)! vol(∆n ).
1=
λ e
d x=λ
Rn
+
Rn
+
0
Hence vol(∆n ) = 1/(n − 1)! and
E(λ̂MLE ) =
n
n(n − 2)!
λ=
λ.
(n − 1)!
n−1
We see that λ̂MLE is not unbiased (though it is asymptotically unbiased ). Thus
λ̂n =
n−1
n−1
λ̂MLE = Pn
n
i=1 Xi
is unbiased.
Next, let us compute the efficiency of this estimator. We first need to compute the variance,
which we do applying once again the very same trick as before
2
Z Pn
n−1
Pn
var(λ̂n ) =
− λ λn e−λ i=1 xi dn x
Rn
i=1 xi
+
2
Z ∞
n−1
n n
n−1
w
= vol(∆ )λ
− λ e−λw dw
w
Z ∞0
λn
=
(n − 1)2 wn−3 − 2(n − 1)λwn−2 + λ2 wn−1 e−λw dw
(n − 1)! 0
E. Bagan
15
1 CLASSICAL ESTIMATION
1.2 Frequentist approach
(n − 1)2 (n − 3)! − 2(n − 1)(n − 2)! + (n − 1)! 2
λ
(n − 1)!
n−1
− 1 λ2
=
n−2
λ2
=
.
n−2
=
We also need the Fisher information:
(
2 )
∂
In (λ) = n E
(log λ − λx)
∂λ
"
2 #
1
= nE
−x
λ
Z ∞
x
1
2
− 2 + x e−λx
= nλ
2
λ
λ
0
n
= 2.
λ
Combining these last two results we get
1
1
2
=
eff λ̂n =
=1− .
2
λ
n
n
In (θ) var θ̂n
·
2
λ n−2
Alternatively, we could have computed vol(∆n ) by a change of variables. For instance, if we
had to compute the integral of some function g(y1 , y2 , . . . , yn ) over the simplex ∆n , i.e.,
Z
P
δ ( ni=1 yi − 1) g(y1 , y2 , . . . , yn )dn y,
Ig =
Rn
+
we could define, e.g.,
y1 = u1 ,
y2 = (1 − u1 )u2 ,
y3 = (1 − u1 )(1 − u2 )u3 ,
..
..
.
.
yn−1 = (1 − u1 )(1 − u2 ) · · · un−1 ,
yn = (1 − u1 )(1 − u2 ) · · · (1 − un−1 ).
P
(Note that the variable yn is not independent!) So that ni=1 yi = 1. Note that 0 ≤ ui ≤ 1,
for i = 1, 2, . . . , n − 1. The Jacobian of the change is very easy to compute because the
E. Bagan
16
1 CLASSICAL ESTIMATION
1.2 Frequentist approach
Jacobian matrix is lower triangular

1
0
0
∗ 1 − u1
0
∂(y1 , . . . , yn−1 ) 
∗
∗
(1
−
u
)(1
− u2 )
1
=
∂(u1 , . . . , un−1 )  ..
..
..
.
.
.
∗
∗
∗
Hence
···
···
···
..
.
···
0
0
0
..
.




.


Qn−2
i=1 (1 − xi )
∂(y1 , . . . , yn−1 )
= (1 − un−2 )(1 − un−3 )2 (1 − un−4 )3 · · · (1 − u1 )n−2 ,
∂(u1 , . . . , un−1 )
and we have
Z 1
Z 1
Z 1
Q
dun−1
(1 − un−2 )dun−2 · · ·
(1 − u1 )n−2 g u1 , (1 − u1 )u2 , . . . , ni=1 (1 − ui ) du1 .
Ig =
0
0
0
In the particular case g ≡ 1 we obtain
vol(∆n ) = I1 = 1 ·
1 1
1
1
· ···
=
.
2 3
n−1
(n − 1)!
Proposition 1.19 (Asymptotic normality) Let {X1 , · · · , Xn } be a sequence of i.i.d. observations where
i.i.d.
Xk ∼ f (x; θ).
Let θ̂ be a MLE of θ, then
√
d
n(θ̂n − θ) → N 0,
1
I1 (θ)
.
(See Lehmann, Elements of Large Sample Theory, Springer, 1999 for a proof.) The meaning
of convergence in distribution is given in this
Definition 1.20 (Convergence in Distribution) A sequence of random variables {X1 , X2 , X3 , . . . }
d
converges in distribution to a random variable X, shown by Xn → X, if
lim FXn (x) = FX (x)
n→∞
for all x at which the CDF, FX (x), is continuous.
At this point it is also useful to recall the central limit theorem, which we quote without proof
Theorem 1.21 (Central Limit Theorem). Let X1 , X2 , . . . be i.i.d. random variables with
E (Xi ) = µ and var (Xi ) = σ 2 < ∞. Define
Pn
Xi − nµ
X̄ − µ
√ .
Zn := i=1 √
=
σ n
σ/ n
Then the distribution function of Zn converges to the distribution function of a standard normal random variable as n → ∞. I.e., Zn converges in distribution to a normally distributed
random variable X ∼ N(0, 1).
E. Bagan
17
1 CLASSICAL ESTIMATION
1.2 Frequentist approach
Exercise 1.15. Show that a MLE is asymptotically efficient.
√
d
d
If θ̂n is a MLE, n(θ̂n − θ) → N(0, I1−1 (θ)), which implies θ̂n → N(θ, In−1 (θ)). Hence
var(θ̂n ) → In−1 and eff(θ̂n ) → 1.
We next wonder if a given estimator extracts all the information about the unknown parameter θ that is available in our samples. Assume we have observed a particular value of θ̂. In
general, there would be various outcomes x = (x1 , . . . , xn ) that would lead to this particular
estimate. If their distribution does not depend on θ, knowing which of them has specifically
happened does not provide further information about the value of θ. This motivates the
following definition.
Definition 1.22 (Sufficient Statistic). Let X = (X1 , . . . , Xn ) be i.i.d. from a probability
distribution with parameter θ. Then the statistic T (X) is called a sufficient statistic for θ
if the conditional distribution of X1 , . . . , Xn given the value of T does not depend on θ.
There is no need to compute conditioned PDFs to check whether some statistic is sufficient
thanks to the following
Theorem 1.23 A statistic T (X) is a sufficient statistic for θ if and only if the joint probability density of X can be factorised into two factors, one of which depends only on T and
the parameters while the other is independent of the parameters:
f (x; θ) = g(t; θ)h (x) .
We do not prove this theorem here. The second factor may be written in terms of t, since it
is a functions of the outcomes, but it cannot depend on θ.
Theorem 1.24 Efficient estimators are sufficient.
The converse is not true; there exist sufficient estimators/statistics that are not efficient.
Proof Theorem 1.23.
From the proof of Theorem 1.16 we know that if θ̂ is unbiased
cov(W, θ̂) = 1.
If, moreover, θ̂ is efficient, var(θ̂) = 1/ var(W ), hence
h
i2 E W − var(W )(θ̂ − θ)
= var(W ) + var(W )2 var(θ̂) − 2 var(W ) E[W (θ̂ − θ)]
= 2 var(W ) − 2 var(W ) cov(W, θ̂) = 0.
Since E(X 2 ) = 0 ⇒ Pr(X = 0) = 1, it must be the case that
W = var(W )(θ̂ − θ) := a(θ)θ̂ + b(θ).
E. Bagan
18
1 CLASSICAL ESTIMATION
1.2 Frequentist approach
But, since
W =
∂
log f (X; θ),
∂θ
we see that
h
i
h
i
f (X; θ) = exp A(θ)θ̂ + B(θ) + C(X) = exp A(θ)θ̂ + B(θ) K(X).
So, f (x; θ) = exp[A(θ)θ̂ + B(θ)]K(x), and since according to Theorem 1.23 this is the
required factorization for sufficientcy, θ̂ is sufficient.
Example 1.4. Suppose we use x̄ to estimate λ, the parameter of the Poisson distribution
p(k; λ) =
λk −λ
e .
k!
For a sample of size n we have
p (x1 , x2 , . . . , xn ; λ) =
n
Y
e−λ λxi
xi !
i=1
Pn
= e−nλ λ
n
Y
1
= e−nλ λnx̄
x!
i=1 i
i=1
xi
n
Y
1
x!
i=1 i
!
,
which is the required factorization according to Theorem 1.23. Hence x̄ is sufficient.
Exercise 1.16. Show that (X̄, S 2 ), defined in Example 1.2, is sufficient statistic to estimate
the parameters µ and σ 2 of a normal distribution, Example 1.1.
Theorem 1.25 From any unbiased estimator that is not based on a sufficient statistic, an
improved estimate can be obtained which is based on the sufficient statistic. It is unbiased
and it has smaller variance, and is obtained by averaging with respect to the conditional
distribution given the sufficient statistic.
So, if R (X) is an unbiased estimate of the parameter θ and T (X) is a sufficient statistic
for θ. The conditional distribution of R given T is
fR | T (r | t) =
fRT (r, t; θ)
,
fT (t; θ)
where fRT (r, t; θ) is the joint probability density function of R and T , and
Z ∞
fT (t; θ) =
f (r, t; θ)dr
−∞
E. Bagan
19
1 CLASSICAL ESTIMATION
1.2 Frequentist approach
is the marginal distribution for T . Because T is a sufficient statistic fR | T (r | t) does not
depend on θ. The improved estimate of θ, S(T ), is the function of T that is obtained by
averaging R with respect to its conditional distribution given T .
Z ∞
rfR | T (r | T )dr.
S(T ) := E[R | T ] =
−∞
Exercise 1.17. Prove Theorem 1.25.
Since R is an unbiased estimator of θ it satisfies
Z ∞Z ∞
rfRT (r, t; θ)dr dt = θ.
E[R] =
−∞
−∞
Let us check that S is also unbiased:
Z ∞Z ∞
Z ∞
rfR | T (r, t)fT (t; θ)dr dt
s(t)fT (t; θ) dt =
E(S) =
−∞ ∞
−∞
Z ∞Z ∞
=
rfRT (r, t; θ)dr dt = E(R) = θ.
−∞
∞
It only remains to check that var(S) ≤ var(R):
var(R) = E (R − θ)2 = E [(S − θ) + (R − S)]2
= var(S) + E (R − S)2 + 2E[(R − S)(S − θ)].
(1.3)
Let us check that the last term is identically zero:
Z ∞Z ∞
[r − s(t)][s(t) − θ]fR,T (r, t; θ)dr dt
E[(R − S)(S − θ)] =
−∞ −∞
Z ∞ Z ∞
[r − s(t)]fR | T (r | t)dr [s(t) − θ]fT (t; θ)dt.
=
−∞
−∞
But the inner integral is zero. Since E[(R−S)2 ] ≥ 0 we see from Eq. (1.3) that var(S) ≤ var(R).
Exercise 1.18. We toss a coin n times [head = 1, tail = 0]. We decide to use X1 (the result
of the first toss, ignoring the rest) as estimate of the probability p of a head, i.e., θ̂ = X1 .
Check θ̂ is unbiased. Compute its variance. Next, check that in n trials the proportion of
heads is a sufficient statistic. Construct an improved estimate from θ̂ based on the proportion
of heads in n trials using Theorem 1.25 and check explicitly that it has smaller variance.
E. Bagan
20
1 CLASSICAL ESTIMATION
1.3
1.3 Bayesian approach
Bayesian approach
Within the Bayesian approach, the estimated parameter θ is assumed to be a random variable, Θ, that is distributed according to a prior PDF, f (θ), representing the knowledge
about θ one possesses before performing the estimation. Therefore, in contrast to the frequentist philosophy, where the estimated parameter was assumed to have a fixed, well defined
value, it is a particular realization of the parameter that is really estimated in a real-life experiment. As a consequence, an optimal estimator must not only be global and minimize
the MSE, but also has to take into account which values of Θ are more probable according
to f (θ). Hence, such an estimator must minimize the Average Mean Squared Error (MSE):
Z
Z
MSE(θ̂) = f (θ)dθ (θ̂ − θ)2 f (x | θ)dn x,
where we recall that the estimator θ̂ is a function of the sample x = (x1 , . . . , xn ) and where
f (x | θ) is the PDF previously labelled as f (x; θ) within the frequentist approach, which
due to stochastic character of the parameter now represents a conditional density. The last
definition can also be written as
Z
MES(θ̂) = (θ̂ − θ)2 f (x, θ)dn x dθ,
where the joint PDF, f (x, θ), is defined via Bayes’ theorem –hence the name of the approach–
in two equivalent ways:
f (x, θ) = f (x | θ)f (θ) = f (θ | x)f (x)
[we abuse notation here by using the same letter f to denote all PDFs. In a more precise
notation one should write fXΘ (x, θ) = fX|Θ (x|θ)fΘ (θ) = fΘ|X (θ|x)fX (x), but
the
R we drop
n
Rsubscripts to ease the notation] In general, the conditional PDFs satisfy f (x | θ)d x =
f (θ | x)dθ
= 1 and the probability of a particular sample corresponds to the marginal
R
f (x) = f (x, θ)dθ. We, then, can also write
Z
Z
n
2
MSE(θ̂) = f (x)d x
(θ̂ − θ) f (θ | x)dθ .
(1.4)
The minimum of this expression is attained by minimizing the square bracket for each outcome x:
Z
Z
2
∂θ̂ (θ̂ − θ) f (θ | x)dθ = 0 ⇒ θ̂ = θf (θ | x)dθ = EΘ|X (Θ).
(1.5)
The optimal Minimum Mean Squared Error (MMSE) estimator simply corresponds to the
average parameter value computed with respect to the posterior PDF, f (θ | x), that in principle may always be computed using Bayes’ theorem:
f (θ | x) = R
E. Bagan
f (x | θ)f (θ)
.
f (x | θ)f (θ)dθ
(1.6)
21
1 CLASSICAL ESTIMATION
1.3 Bayesian approach
Within the Bayesian framework, one should view the process of data inference as a procedure
in which the effective PDF of the estimated parameter θ becomes updated. Hence, the
posterior PDF f (θ | x) represents the prior f (θ) that has been reshaped and narrowed-down
after learning the sample x:
observe
x
f (θ) −→ f (θ | x)
whereas the MMSE estimator (1.5) just outputs the mean of such an effective distribution.
Moreover, the minimal MSE (1.4) then reads
Z
Z
Z
2
n
MSE(θ̂) = f (x)d x
f (θ | x) θ − EΘ|X (Θ)
= f (x) var Θ|X (Θ)dn x,
so that it represents the variance of the parameter Θ computed also with respect to f (θ | x)
and averaged over all the possible outcomes.
It is really important within the Bayesian approach to choose an appropriate f (θ) such
that, on one hand, it adequately represents the knowledge about the parameter before the
estimation, but, on the other, it does not significantly overshadow the information obtained
from the data collected.
Exercise 1.19. Consider the extremal case where the prior PDF is the Dirac delta distribution, fδ (θ) = δ (θ − θ0 ), which represents the case when we perfectly know the estimated
parameter before performing the estimation. Compute the MMSE θ̂ and discuss the role of
the observations. What is MSE(θ̂)?
Note that so far we did not require at any stage the sampled data to be independently distributed. Such property, which previously was heavily used within the frequentist approach,
is not necessary in the derivation of the optimal Bayesian estimator, which relies only on
the form of the posterior PDF (1.6). In fact, as independently distributed data may be
interpreted as if it was collected carrying out consecutive repetitions of the estimation protocol, the Bayesian results in such a case may be understood as a progressive updating of the
knowledge we possess about the parameter, where at each step the posterior is calculated
based only on the outcomes xi but for the prior already updated with the results xi−1 .
observe
x1
observe
x2
observe
x3
f (θ) −→ f (θ | x1 ) −→ f (θ | x1 , x2 ) −→ f (θ | x1 , x2 , x3 ) · · ·
Exercise 1.20. Show that the interpretation of (1.6) for independent samples as progressive
updating of the prior PDFs is correct.
Consider the obvious relations [We use independency in the very first line: f (x1 , . . . , xn | θ) =
Q
n
i=1 f (xi | θ) = f (xn | θ)f (x1 , . . . , xn−1 | θ)]:
E. Bagan
22
1 CLASSICAL ESTIMATION
1.3 Bayesian approach
f (θ | x1 , x2 , . . . , xn ) = R
f (xn | θ)f (x1 , . . . , xn−1 | θ)f (θ)
f (xn | θ)f (x1 , . . . , xn−1 | θ)f (θ)dθ
=R
f (xn | θ)f (θ | x1 , . . . , xn−1 )f (x1 , . . . , xn )
f (xn | θ)f (θ | x1 , . . . , xn−1 )f (x1 , . . . , xn )dθ
=R
f (xn | θ)f (θ | x1 , . . . , xn−1 )
.
f (xn | θ)f (θ | x1 , . . . , xn−1 )dθ
So, the updating in the last step from the prior PDF f (θ | x1 , . . . , xn−1 ) is based entirely on
the observation xn . We can, obviously, repeat the procedure with f (θ | x1 , . . . , xn−1 ), and
so on.
The MMSE estimator plays and special role because of the following result. For any regular
prior f (θ) one has
1
1
n→∞
,
(1.7)
MSE(θ̂) = EΘ
≥
In (Θ)
EΘ [In (Θ)]
where the last expression follows from the Jensen inequality stating that for any convex
function f (X) one has E[f (X)] ≥ f [E(X)]. If the Fisher information In is independent of θ
we can, of course, drop the expectations and the bound saturates. This relation enables us
to establish a connexion between the Bayesian and frequentist approaches.
Alternatively, one could prove the last inequality in (1.7) by invoking the Hölder inequality:
Z
Z
|g(x)h(x)|dx ≤
For g(x) =
p
p
|g(x)| dx
xf (x), h(x) =
f (x)dx ≤
q
1/q
|h(x)| dx
,
for all p, q such that
1 1
+ = 1.
p q
p
f (x)/x, p = q = 2, we have
Z
Z
1=
1/p Z
1/2 Z
1/2
f (x)
|xf (x)dx
dx
,
x
Therefore
1 ≤ E(X) E(X −1 )
⇒
E(X −1 ) ≥
assuming x > 0.
1
.
E(X)
Within the Bayesian framework, nothing prevents us to consider other figures of merit, i.e.,
cost functions C(θ̂, θ), in order to generalize the MSE, and define the average cost, EΘ [C(θ̂)]:
Z
Z
C(θ̂) = f (θ)dθ C(θ̂, θ)f (x | θ)dx.
The MSE is the special case C(θ̂, θ) = (θ̂ − θ)2 .
E. Bagan
23
2 QUANTUM ESTIMATION
Example 1.5. To deal with a circularly symmetric parameter, we can consider the simplest
cost function introduced by Holevo:
!
θ̂
−
θ
.
(1.8)
CH (θ̂, θ) = CH (θ̂ − θ) = 4 sin2
2
It is periodic (as it should if one has circular symmetry) and CH (θ̂, θ) ∼ (θ̂ − θ)2 (i.e.,
approaches the squared error) as θ̂ → θ.
2
2.1
2.1.1
Quantum Estimation
Frequentist (pointwise) approach
The quantum Cramer-Rao bound
As we mentioned in the Rintroduction, in quantum mechanics we have (Born rule) f (x; θ) =
tr (Ex ρθ ), where {Ex } , dxEx = 1, are the elements of a POVM and ρx is the density
operator parametrized by the quantity we want to estimate. Let us introduce the
Definition 2.1 [Symmetric Logarithmic Derivative (SLD)] The SLD, Lθ is the self-adjoint
operator satisfying the equation
∂ρθ
Lθ ρθ + ρθ Lθ
=
= ∂θ ρθ .
2
∂θ
Note that
∂θ f (x | θ) = ∂θ tr [Ex ρθ ] = tr [Ex ∂θ ρθ ]
Lθ ρθ + ρθ Lθ
= tr Ex
2
1
1
= tr [Ex Lθ ρθ ] + tr [Ex ρθ Lθ ]
2
2
i∗
1
1h
= tr [Ex Lθ ρθ ] +
tr (Ex ρθ Lθ )†
2
2
1
1
= tr [Ex Lθ ρθ ] + [tr Lθ ρθ Ex ]∗ ,
2
2
where we have used the cyclic property of the trace. We can then write
∂θ f (x | θ) = < [tr (ρθ Ex Lθ )] .
We can use this result to express the Fisher information as
Z
{< [tr (ρθ Ex Lθ )]}2
I1 (θ) = dx
.
tr(ρθ Ex )
E. Bagan
(2.9)
24
2 QUANTUM ESTIMATION
2.1 Frequentist (pointwise) approach
The numerator of the integrant can be bounded as
√ p p
√ 2
2
2
{< [tr (ρθ Ex Lθ )]} ≤ |tr (ρθ Ex Lθ )| = tr
ρθ Ex Ex Lθ ρθ
√ p p √ p
p √ √
ρθ Ex Ex ρθ tr
Ex Lθ ρθ ρθ Lθ Ex
≤ tr
(2.10)
(2.11)
= tr(ρθ Ex ) tr(Lθ Ex Lθ ρθ ),
where we have used the Schwartz inequality:
2
tr A† B
≤ tr A† A tr B † B .
(2.12)
We have also used that ρθ , Ex ≥ 0 and Lθ is self-adjoint. Substituting this bound in Eq. (2.9)
we have
Z
Z
dxEx Lθ ρθ = tr(L2θ ρθ ),
I1 (θ) ≤ dx tr(Lθ Ex Lθ ρθ ) = tr Lθ
(2.13)
which states that the Fisher information I1 (θ) of any quantum measurement is bounded by
the so-called
Definition 2.2 (Quantum Fisher Information QFI). The QFI is defined as
H(θ) := tr(L2θ ρθ ) = tr [Lθ (∂θ ρθ )] .
The second form of the definition follows from
tr(L2θ ρθ ) =
ρθ Lθ + Lθ ρθ
tr(Lθ ρθ Lθ ) + tr(L2θ ρθ )
= tr{Lθ
} = tr {Lθ (∂θ ρθ )} .
2
2
Eq. (2.13), through Theorem 1.16, leads to
Theorem 2.3 (Quantum Cramér-Rao bound) The variance of any estimator θ̂n of the parameter θ characterizing the family of states ρθ is bounded by
var(θ̂n ) ≥
1
.
nH(θ)
This is the quantum version of the Cramér-Rao theorem and provides an ultimate bound
the the sensitivity that can be achieved in parameter estimation in the quantum mechanical
framework.
The quantum Fisher Information is an upper bound for the Fisher Information as it embodies
the optimization of the Fisher Information over any possible measurement. Optimal quantum
measurements for the estimation of θ thus correspond to POVM with Fisher information
equal to the quantum Fisher information, i.e., those saturating both inequalities Eq. (2.10)
and Eq. (2.11). The first one is saturated when tr [ρθ Ex Lθ ] is a real number. The second
one is based on the Schwartz inequality, Eq. (2.12), which is saturated when matrices A and
B are proportional. Hence, we must have
p √
p
√
Ex ρθ = cx Ex Lθ ρθ
E. Bagan
25
2 QUANTUM ESTIMATION
2.1 Frequentist (pointwise) approach
for all x. This condition can always be met by choosing the operators Ex to be assembled
from one-dimensional projectors onto a complete set of orthonormal eigenstates of Lθ .
Exercise 2.1. Consider the set of qubit pure states that lie in the equator of the Bloch
sphere (θ = π/2). They are parametrized by the azimuthal angle φ. Compute the SLD, Lφ ,
for this one-parameter family. Compute the QFI, H(φ). Show that by measuring n copies
of these equatorial states on the orthogonal bases {|+i, |−i} and using the MLE estimator
to process the classical data obtained from the measurement the QCR bound is saturated
asymptotically.
You should attempt to solve this exercise by yourself, but because of its relevance to a
discussion about the Heisenberg limit below, we provide a solution here.
Solution.
Let us compute the SLD brute force (we will learn about other methods below). Define the
unit vectors r̂ = (cos φ, sin φ, 0) and φ̂ = (− sin φ, cos φ, 0) = ∂φ r̂, we have
ρφ =
Write
1 + r̂ · σ
2
⇒
∂φ ρφ =
φ̂ · σ
.
2
Lφ = a1 + b · σ.
Then, using the definition of the SLD, we must have
2φ̂ · σ = (a1 + b · σ)(1 + r̂ · σ) + (1 + r̂ · σ)(a1 + b · σ)
X
= 2a1 + 2(b + ar̂) · σ +
(bi r̂j + bj r̂i )σi σj
ij
= 2(a + b · r̂)1 + 2(b + ar̂) · σ.
It follows that a + b · r̂ = 0 and b + ar̂ = φ̂. Since r̂ · φ̂ = 0, the second condition implies
the first one and b = φ̂ − ar̂ for any a. Hence, the SLD is not uniquely defined:
Lφ = a1 + (φ̂ − ar̂) · σ
for any a ∈ R.
From the definition of QFI, we readily see that
)
(
h
i φ̂ · σ
= (φ̂ − ar̂) · φ̂ = 1.
H(φ) = tr a1 + (φ̂ − ar̂) · σ
2
E. Bagan
26
2 QUANTUM ESTIMATION
2.1 Frequentist (pointwise) approach
The PMF of the measurement {|+i, |−i} is
1 + eiφ
p(+; φ) = tr (|+ih+|ρφ ) = |h+|ψφ i| =
2
φ
,
= cos2
2
φ
2
p(−; φ) = sin
.
2
2
2
So, we see that the outcome of each individual measurement is a Bernoulli random variable.
By assigning 1 to outcome + and 0 to outcome −, the PMF of such variable is
φ
φ
2x
2(1−x)
X ∼ p(x; φ) = cos
sin
.
2
2
The (log-)likelihood function is
1 ± cos φ
L(φ | ±) =
;
2
l(φ | ±) = log
Hence
∂φ l(φ | ±) =
1 ± cos φ
2
.
∓ sin φ
,
1 ± cos φ
and
"
I1 (φ) = E
∓ sin φ
1 ± cos φ
2 #
sin2 φ 1 − cos φ
sin2 φ 1 + cos φ
+
= 1.
=
(1 + cos φ)2
2
(1 − cos φ)2
2
From which In (φ) = n, and we know that the MLE of this measurement will saturate the
bound asymptotically.
Measuring each individual copy of the given n states on the {|+i, |−i} basis we have
nx̄ n(1−x̄)
φ
φ
2
p(x1 , . . . xn ; φ) = cos
sin
.
2
2
2
Checking the right hand side of this expression we clearly see that X̄ is a sufficient statistic
for φ. We also see that X̄ is binomially distributed:
φ
X̄ ∼ Bin n, cos
2
2
⇔
pX̄ (x̄; φ) =
n
nx̄
nx̄ n(1−x̄)
φ
φ
2
2
cos
sin
.
2
2
Hence, L(φ | x̄) = pX̄ (x̄; φ). A straightforward derivation leads to the expression of the MLE
φ̂ = arccos (2x̄ − 1) .
E. Bagan
27
2 QUANTUM ESTIMATION
2.1 Frequentist (pointwise) approach
This completes the solution of the exercise.
Going back to our general discussion, one can find a closed form for the SLD in terms of the
spectral representation of ρθ ,
X
ρθ =
λa |ψa ihψa |.
a
It is given by
Lθ =
X 2hψa |∂θ ρθ |ψb i
λa + λb
a,b
|ψa ihψb |,
(2.14)
where the sum extends to all a and b such that λa + λb 6= 0.
Exercise 2.2. Prove Eq. (2.14).
2.1.2
The pure state model
A much simpler expression can be written for pure states. A straightforward calculation
gives
Lθ = 2∂θ ρθ = 2 (|∂θ ψθ i hψθ | + |ψθ i h∂θ ψθ |) .
(2.15)
From this result one can easily obtain the QFI of this so called pure state model:
H(θ) = 4 h∂θ ψθ |ψθ i2 + h∂θ ψθ |∂θ ψθ i .
(2.16)
Exercise 2.3. Prove that for pure states one can choose the SLD to be given by Eq. (2.15).
By using this result, prove Eq. (2.16).
Pure states satisfy ρ2θ = ρθ , hence
∂θ ρθ = ∂θ ρ2θ = (∂θ ρθ ) ρθ + ρθ (∂θ ρθ ) .
By comparing with the definition of Lθ ,
∂ρθ =
Lθ ρθ + ρθ Lθ
,
2
we see that we can choose Lθ /2 = ∂θ ρθ , and
Lθ = 2∂θ ρθ = 2∂θ (|ψθ ihψθ |) = 2 (|∂θ ψθ ihψθ | + |ψθ ih∂θ ψθ |) ,
where, assuming {|αi} is a fixed (θ-independent) basis of the Hilbert space,
!
X
X
α
|∂θ ψθ i = ∂θ
ψθ |αi =
(∂θ ψθα ) |αi
α
E. Bagan
α
28
2 QUANTUM ESTIMATION
2.1 Frequentist (pointwise) approach
and
!
h∂θ ψθ | = ∂θ
X
(ψθα )∗ hα|
X
=
(∂θ ψθα )∗ hα|.
α
α
The QFI is easily derived noticing that for pure states one can write
H(θ) = tr [Lθ (∂θ ρθ )] =
1
tr L2θ .
2
Then,
H(θ) = 2 tr [(|∂θ ψθ ihψθ | + |ψθ ih∂θ ψθ |) (|∂θ ψθ ihψθ | + |ψθ ih∂θ ψθ |)]
= 2 hψθ |∂θ ψθ i2 + h∂θ ψθ |ψθ i2 + 2h∂θ ψθ |∂θ ψθ i
= 4 h∂θ ψθ |ψθ i2 + h∂θ ψθ |∂θ ψθ i ,
where we have used that
hψθ |∂θ ψθ i = −h∂θ ψθ |ψθ i,
which follows from taking derivative of hψθ |ψθ i = 1.
Let us consider the case where the parameter of interest, θ, is the amplitude of a unitary
perturbation imprinted to a given initial pure state |ψ0 i. The family of quantum states we
are dealing with may be expressed as
|ψθ i = Uθ |ψ0 i,
where Uθ = exp{−iθH} is a unitary operator and H is the corresponding Hermitian generator (we may think of it as the “Hamiltonian” of the system). This example is of particular
interest in metrology.
From Eq. (2.16) the QFI can be easily computed to be
H(θ) = 4(∆H)2ψθ = 4(∆H)2ψ0 ,
(2.17)
where (∆H)ψ is the standard deviation (uncertainly) of the hermitian operator H in the
state |ψi, defined through
(∆H)2ψ = hψ|H 2 |ψi − hψ|H|ψi2
(it is just the variance in the quantum mechanical sense). The QCR bound is then
var(θ̂) ≥
1
,
4n(∆H)2ψ0
and we note that it is independent of θ, hence providing a global bound.
Exercise 2.4. Derive Eq. (2.17).
E. Bagan
29
2 QUANTUM ESTIMATION
2.1 Frequentist (pointwise) approach
|∂θ ψθ i = ∂θ Uθ |ψ0 i = −iHUθ |ψ0 i = −iH|ψθ i.
h∂θ ψθ |ψθ i = ihψθ |H|ψθ i = ihψ0 |H|ψ0 i,
h∂θ ψθ |∂θ ψθ i = hψθ |H 2 |ψθ i = hψ0 |H 2 |ψ0 i.
Substituting these expressions in Eq. (2.16) we obtain the desired result
H(θ) = 4 (ihψθ |H|ψθ i)2 + hψθ |H 2 |ψθ i = 4 (∆H)2ψθ = 4 (∆H)2ψ0 .
The lower bound we have obtained is a function of the reference state. We should now find
what is the best state |ψ0 i to estimate θ. We will prove that
Claim 2.4 The maximum value that (∆H)ψ can achieve is half of the so called spread of H,
namely,
|λmax − λmin |
(∆H)ψ ≤
.
2
This value is attainable with the choice
|ψ0 i =
|λmin i + |λmax i
√
,
2
(2.18)
where |λmin i (|λmax i) is the eigenstate of the minimum (maximum) eigenvalue, λmin , (λmax )
of H.
Then, the QCR bound for this family of states reads
var(θ̂) ≥
1
.
n(λmax − λmin )2
(2.19)
Proof of the claim.
Let the spectral decomposition of H be given by
X
H=
λa |λa ihλa |.
a
A generic state |ψi can be written in this eigenbasis as
X
|ψi =
ψa |λa i.
a
Then,
!2
(∆H)2ψ
=
X
a=0
E. Bagan
λ2a pa
−
X
λa pa
,
(2.20)
a
30
2 QUANTUM ESTIMATION
2.1 Frequentist (pointwise) approach
P
where pa := |ψa |2 ≥ 0 are (of course!) probabilities, a pa = 1. Instead H let us consider
the operator
H − λmin 1
,
H̃ =
λmax − λmin
whose minim eigenvalue, λ̃min , is zero and maximum eigenvalue, λ̃max , is one. We obviously
have
2
(∆H)2ψ = (λmax − λmin )2 ∆H̃ .
(2.21)
ψ
Let us now show that (∆H̃)2ψ = 1/4. Eq. (2.20) is completely general, so it also holds for H̃,
P
but λ̃2a ≤ λ̃a . Hence, if we define u := a λ̃a pa ≥ 0, we have
∆H̃
2
≤ u − u2 ,
for all u ≥ 0.
(2.22)
ψ
The maximum value of the right hand side is 1/4 (for u = 1/2). Substituting in (2.21) we
obtained the desired bound:
2
λmax − λmin
2
(∆H)ψ =
.
2
The bound (2.22) is attained iff pa = 0 for all λ̃a with the exception of λ̃b = 1 = λ̃max and
λ̃c = 0 = λ̃min (since it follows from the inequality λ̃2a ≤ λ̃a ). Thus, any state of the form
|ψi =
|λmin i + eiw |λmax i
|λ̃min i + eiw |λ̃max i
√
√
=
.
2
2
attains the maximum. In particular we can choose w = 0.
2.1.3
The Heisenberg limit
Now we are going to show something remarkable. Let us go back to Exercise 2.1. The
state ρφ is pure, so ρφ = |ψφ ihψφ |, and
|ψφ i =
|0i + e−iφ |1i
√
= Uφ |ψ0 i,
2
where
−iφH
Uφ = e
= |0ih0| + e
−iφ
|1ih1|,
|ψ0 i = |+i,
H = |1ih1|
⇒
λmin = 0
.
λmax = 1
Using the QCR bound derived in Eq. (2.19) we recover the result of the exercise, namely
var(φ̂) ≥
E. Bagan
1
1
= .
2
n(1 − 0)
n
31
2 QUANTUM ESTIMATION
2.2 Bayesian approach
and this bound is attainable. If we repeat the same experiment (measuring n copies to
estimate φ) N times the variance would, of course, be bounded as
var(φ̂) ≥
1
.
Nn
(2.23)
Suppose, however, that we proceed in a different way. Instead of preparing a product
state, |ψ0 i⊗n , and measuringP
each copy separately, we view the n copies as a whole system, S,
with “Hamiltonian” HS = nk=1 Hk , where Hk = |1ik h1| ⊗ 1S−{k} is the “Hamiltonian” of
the k-th qubit, and choose the fiducial state as in (2.18). In this case
|0i⊗n + |1i⊗n
|λmin i + |λmax i
λmin = 0
√
√
=
⇒
|ψ0 i =
λmax = n
2
2
(note it is a highly entangled –hence very fragile– state!). Then, according to (2.19), if we
repeat the experiment N times we will get a much enhanced sensitivity, with a variance
scaling quadratically with the inverse of the size of the system
var(θ̂) ≥
1
1
=
.
2
N (n − 0)
N n2
We refer to this behavior, var(θ̂) ∼ 1/n2 , as the Heisenberg limit, in contrast to the standard
quantum (or shot-noise) limit, where var(θ̂) ∼ 1/n [as in Eq. (2.23)]. In quantum metrology
based on interferometry, the Heisenberg limit can be achieved using squeezed states (instead
of the “classical” coherent states that have shot-noise limited sensitivity). This falls beyond
the scope of this course and will not be discussed here.
2.2
Bayesian approach
To give a flavor of what the Bayesian approach is about, let us look at the pure state model
of Section 2.1.2 from this point of view. We will use the cost function (1.8) introduced in
Example 1.5. The averaged cost function is
!
Z 2π
Z
dφ
φ̂
−
φ
x
CH (φ̂) =
4 sin2
tr Ex Uφ |ψ0 ihψ0 |Uφ† dm x,
2π
2
0
where |ψ0 i ∈ (C2 )⊗n and we emphasize that the estimate φ̂ depends on the outcomes x by
writing φ̂x . We assume, without loss of generality, that the outcomes, x, of the measurement
are continuous random variables
(vectors of some dimension m; the “volume element” dm x
R m
is normalized such that d x = 1). This can be witten as
Z 2π
Z
dφ CH (φ̂) = 2
1 − cos(φ̂x − φ) hψ0 |Uφ† Ex Uφ |ψ0 idm x
2π
0
Z 2π
Z
dφ
†
iφ̂x −iφ
m
=2−<
e e hψ0 |Uφ Ex Uφ |ψ0 id x .
2π
0
E. Bagan
32
2 QUANTUM ESTIMATION
2.2 Bayesian approach
The unitary matrix Uφ acts on (C2 )⊗n , so it can be written as
Uφ =
n
X
e−ikφ |kihk|,
(2.24)
k=0
where each |ki spans the (one-dimentional) irreducible representations of the unitary group U (1).
Explicitly they are
−1/2
n
|ki =
eiϕk |0i ⊗ · · · ⊗ |0i ⊗ |1i ⊗ · · · ⊗ |1i +permutations .
|
{z
}
k
(2.25)
k
The normalization coefficient comes about because there are nk different orthogonal terms
on the right hand side and the phases ϕk are arbitrary; we can choose as we wish. We now
note that
Z 2π
n
X
dφ −iφ †
e Uφ ⊗ Uφ =
|k + 1ihk + 1| ⊗ |kihk|.
(2.26)
2π
0
k=0
R 2π
This is so because 0 dφ/(2π) exp(isφ) = 0 for any s ∈ Z, s 6= 0. Using this property, we
readily see that
( n Z
)
X
CH (φ̂) = 2 − 2<
eiφ̂x ck+1 ck (Ex )k+1,k dm x .
(2.27)
k=0
Here we have introduced the definitions (Ex )k+1,k := hk + 1|Ex |ki and ck := hk|ψ0 i, where
the arbitrary phases ϕk in (2.25) have been chosen so that ck ≥ 0. We know that these
phases have no physical relevance, so this choice cannot affect our result. Now, note the
following chain of inequalities:
CH (φ̂) ≥ 2 − 2
n Z
X
eiφ̂x ck+1 ck (Ex )k+1,k dm x
k=0
≥2−2
≥2−2
n Z
X
k=0
n
X
ck+1 ck |(Ex )k+1,k | dm x
Z q
q
ck+1 ck
(Ex )k,k (Ex )k+1,k+1 dm x
k=0
≥2−2
n
X
sZ
ck+1 ck
Z
(Ex )k,k
dm x
(Ex )k+1,k+1
dm x
k=0
=2−2
n
X
ck+1 ck .
(2.28)
k=0
In the first line we have used that |<(z)| ≤ |z| for any z ∈ C. The triangle inequality led to
the second line. The positivity condition Ex ≥ 0 implies |(Ex )k+1,k |2 ≤ (Ex )k,k (Ex )k+1,k+1 ,
(Ex )k,k ≥ 0, (Ex )k+1,k+1 ≥ 0, which enabled us to write the third inequality. Schwarz
E. Bagan
33
2 QUANTUM ESTIMATION
2.2 Bayesian approach
R
inequality led to the forth line, and finally, the POVM condition Ex dm x = 1 enabled us
to get rid of the POVM operators in the last line and got an absolute bound. Attainability
is shown by providing an explicit measurement that saturates the bound, e.g.,
Eφ̂ = Uφ̂ |Φ0 ihΦ0 |Uφ̂† ,
where
n
X
|Φ0 i =
|ki
k=0
(note |Φ0 i is not normalized).
Exercise 2.5. Check that the set of operators {Eφ̂ | φ̂ ∈ [0, 2π)} defines a (continuous)
POVM. Show that it saturates the bound in Eq. (2.28).
The operators Eφ̂ are manifestly positive, so we only need to check that they add up to the
identity operator:
Z
Z
dφ̂
dφ̂
Eφ̂ =
Uφ̂ |Φ0 ihΦ0 |Uφ̂† .
2π
2π
From Eq. (2.24) we have
Z
0
2π
n
n
XX
dφ̂
|kihk| ⊗ |lihl|
Uφ̂ ⊗ Uφ̂† =
2π
k=0 l=0
e−i(k−l)φ̂
0
hence
Z
dφ̂
E =
2π φ̂
n
2π
Z
dφ̂ X
|kihk| ⊗ |kihk|,
=
2π
k=0
n
Z
X
dφ̂
Uφ̂ |Φ0 ihΦ0 |Uφ̂† =
|hk|Φ0 i|2 |kihk|.
2π
k=0
But
hk|Φ0 i = hk|
n
X
!
|li
= hk|ki = 1.
l=0
Substituting this in the previous equation we get
Z
n
X
dφ̂
Eφ̂ =
|kihk| = 1.
2π
k=0
So the set {Eφ̂ | φ̂ ∈ [0, 2π)} defines a proper POVM. Let us now check that it saturates de
bound. For our POVM Eq. (2.27) can be written as
( n
)
Z 2π
X
dφ̂ iφ̂
CH (φ̂) = 2 − 2<
ck+1 ck
e (Eφ̂ )k+1,k .
2π
0
k=0
E. Bagan
34
2 QUANTUM ESTIMATION
2.2 Bayesian approach
Let us compute the integral:
!
Z 2π
Z 2π
dφ̂ iφ̂
dφ̂ iφ̂
†
e (Eφ̂ )k+1,k = hk + 1|
e Uφ̂ |Φ0 ihΦ0 |Uφ̂ |ki
2π
2π
0
0
!
n
X
X
=
hk + 1|
|l + 1ihl + 1|Φ0 ihΦ0 |lihl| |ki
l
l=0
= hk + 1|Φ0 ihΦ0 |ki = 1
[we have used the hermitian conjugate of Eq. (2.26) in the second line]. Hence
( n
)
n
X
X
CH (φ̂) = 2 − 2<
ck+1 ck = 2 − 2
ck+1 ck .
k=0
k=0
Our last task is to minimize the cost function CH (φ̂). Since ck are the components of the
referenceP
state |ψ0 i in the basis {|ki}nk=0 and they are real (because of our choice of phases)
we have nk=0 c2k = 1. So,
  

2
−1
0
0
···
0
0
c0
−1
2
−1
0
···
0
0   


 0
−1
2
−1 . . .
0
0   c1 
  



  

.
(2.29)
CH (φ̂) ≥ c0 c1 . . . cn 

..
.
.
.
.
.
..
..
..
..
..   .. 
 .
  . 



  

 0
0
0
· · · −1
2
−1
cn
0
0
0
···
0
−1
2
The minimum of this quadratic form is given by the minimum eigenvalue of the symmetric
matrix on the right hand side.
Exercise 2.6.
Let
Mij and {ck }nk=1 real
PnM be2 a n × n symmetric matrix with entries P
n
coefficients so that
k=1 ck = 1. Show the the minimum of C =
i,j=1 cj Mjk ck is the
minimum eigenvalue of M .
Diagonalizing the matrix in (2.29) is an easy task that surely enough you have accomplished
in your classical mechanics courses when dealing with coupled oscillators, as it shows up
when a chain of n + 1 equal masses are connected with springs of equal strength (the loaded
string). The eigenvalues are there computed to be1
jπ
2
λj = 4 sin
, j = 1, 2, . . . , n + 1.
2(n + 2)
By borrowing this result we obtain that the minimum cost is
π
min
2
CH (φ̂) = 4 sin
.
2(n + 2)
1
See, e.g., Classical Dynamics of particles and systems (fifth edition), Stephen T. Thornton and Jerry B.
Marion, Brooks/Cole (2004)
E. Bagan
35
2 QUANTUM ESTIMATION
2.2 Bayesian approach
Exercise 2.7. Find the optimal reference state |ψ0 i. Namely, compute the coefficients ck
for which the minimum cost is attained.
Quite remarkably this average cost is exact for all n ! One can easily verify that
min
CH (φ̂)
π2
∼ 2
n
as n → ∞.
Since in average φ̂ is very close to φ asymptotically, we have
min
MSE(θ̂) ≈ CH (φ̂) ∼
π2
n2
as n → ∞.
This tell as that the protocol consisting in preparing a system of n qubits in the state |ψ0 i
and performing the measurement defined by {Eφ̂ | 0 ≤ φ̂ < 2π}, whose outcome is the
estimate, φ̂, has Heisenberg-limited sensitivity. The factor π 2 is the price we pay for using a
global estimator.
Bibliography
[1] Probability and Statistics II. Notes by G. Deligiannidis.
[2] Quantum estimation for quantum technology, M. G. A. Paris, arXiv:0804.2981v3
[3] Precision bounds in noisy quantum metrology, J. Kolodynski, arXiv:1409.0535v2
[4] Nonlinear quantum metrology, S. Boixo, PhD thesis, University of New Mexico (2008).
E. Bagan
36
Download