file - BioMed Central

advertisement
Dankar et al: Estimating the Re-identification Risk of Clinical Data Sets: A Simulation Study
Appendix A: Uniqueness Estimation Models
Appendix A: Uniqueness Estimation Models
We will first introduce some notation then briefly describe all the estimation models evaluated in this
paper. Let
N and n be the size of the population and the sample respectively, K and u denote the
number of non-zero equivalence classes in the population and the sample respectively, and
Fi and f i
denote the size of the i th equivalence class in the population and the sample respectively, where
i  {1,...K } ( {1,..., u} respectively). Also, let Si and si be the number of equivalence classes of size i in
the population and the sample respectively, and let

be the sampling fraction.
We define P( F  1| f  1) as the probability that an equivalence class of size 1 in the sample was
chosen from an equivalence class of size 1 in the population. Note, however, that estimating the number
of sample uniques that are populaiton uniques does not tell us how by itself many uniques are in the
population.
Below is a brief description of the models evaluated using the above notation:

Equivalence class model (Zayatz): This model uses Baye’s rule to calculate the probability that
a sample unique is also a population unique:
P( F  1| f  1) 
p1P( f  1| F  1)
where
 p j P( f  1| F  j)
j
p j is the proportion of the equivalence classes in the population that are of size j , p j is
estimated using the sample, and
values of

P( f  i | F  j ) follows a hypergeometric distribution for all
j . The number of population uniques becomes: S1  s1 P ( F  1| f  1)
N
.
n
Pitman model: the pitman model is defined by:
Sj
 (   )(  2 )...(  ( K  1) ) N  (1   )[ j 1]  1
P( S1 , S 2 ,..., S N )  N !


 S !
 (  1)...(  N  1)
j!
j 1 

j
where
(1   )[ j 1]  (1   )(2   )...( j  1   ) , and  and  are real parameters describing the
sampling scheme in the Pitman model (refer to [66] for more details). The population uniques are
1/3: 687319087
Dankar et al: Estimating the Re-identification Risk of Clinical Data Sets: A Simulation Study
Appendix A: Uniqueness Estimation Models
then estimated as:
E ( S1 ) 
(  1) 
N where  is the gamma function. The Pitman model
(   )
has the exchangeability property with respect to individuals in the population. Hoshino uses this
property to construct the Maximum Likelihood Estimators of

and

[66]:
n 1
L u 1 1
1


0
 i 1   i i 1   i
n
i 1
L u 1 i
1

  si 
0
 i 1   i i  2 j 1 j  
With:
  (   )(  2 )...(  (u  1) ) n  (1   )[ j 1]  s j 1 
L  Log  n !


 s ! 

 (  1)...(  n  1)
j!
j 1 
 j 

u 1
n 1
n
 j 1

 const   log(  i )   log(  j )   s j   log(i   ) 
i 1
j 1
j 2
 i 1

Where
const is a value not depending on  or  .
Hoshino then solves the above equations by the Newton-Raphson method using the second
derivates:


2L
2 L 2 L
and
with starting values:
,

( ) 2 ( ) 2
 ( s1  n)  (n  1) s1
nuc  s1 (n  1)(2u  c)
s ( s  1)
and  
, where c  1 1
.
nu
2 s1u  s1c  nc
s2
Slide negative binomial model: This model assumes a slide negative binomial distribution for
the population cell frequencies:
P( Fk  y ) 
(  ( y  1)) 
 (1   ) y 1 , where
( )( y  1)!

and

are the parameters of the gamma distribution that models the expected population cell frequency.
The expected number of uniques in the population is then shown to be:
2/3: 687319087
E ( S1 )  K   . To
Dankar et al: Estimating the Re-identification Risk of Clinical Data Sets: A Simulation Study
Appendix A: Uniqueness Estimation Models
estimate the

and
 parameters, the following equations need to be solved numerically:


   (1   )(1   )


s1  K  
 1


1  (1   )(1  )   1 (1  )(1  ) 
   2 (1   )
s2  K
2  (1   )(1   )(1  )
 2 
2 1  (1   )(1  )
In our implementation we used the modified Shlosser estimate for K described in [95]. This
particular estimate was the most likely to result in convergence of the SNB model based on our
simulations.

Mu-Argus: This model has not been used in the context of population uniques estimation.
 P( F
However it can be used to calculate
k
 1| f k  1) , i.e., the expected number of sample
k
uniques that are population uniques. The total number of population uniques can then be
estimated from this quantity as is the case with the equivalence class model above, i.e.
S1   P( Fk  1| f k  1)
k
Fk / f k
N
n
It
a
model
with
the
assumption:
NB( f k , pk )  f k , this is the number of trials until f k successes occur with the
probability of success being

Benedetti proposed [65]:
pk 
pk . To calculate P( Fk  1| f k  1) , one needs to estimate pk .
fk

Fk
where
proposes

where Fk
D
D
  wi is the initial estimate for the population,
i
wi are the sampling weights. However, since we use simple random sampling we have

n  p k N . And the number of sample uniques that are population uniques becomes

 P( Fk  1| fk  1)   pk  s1 .
k
f k 1
3/3: 687319087
Download