file - BioMed Central

Dankar et al: Estimating the Re-identification Risk of Clinical Data Sets: A Simulation Study Appendix A: Uniqueness Estimation Models Appendix A: Uniqueness Estimation Models We will first introduce some notation then briefly describe all the estimation models evaluated in this paper. Let N and n be the size of the population and the sample respectively, K and u denote the number of non-zero equivalence classes in the population and the sample respectively, and Fi and f i denote the size of the i th equivalence class in the population and the sample respectively, where i  {1,...K } ( {1,..., u} respectively). Also, let Si and si be the number of equivalence classes of size i in the population and the sample respectively, and let  be the sampling fraction. We define P( F  1| f  1) as the probability that an equivalence class of size 1 in the sample was chosen from an equivalence class of size 1 in the population. Note, however, that estimating the number of sample uniques that are populaiton uniques does not tell us how by itself many uniques are in the population. Below is a brief description of the models evaluated using the above notation:  Equivalence class model (Zayatz): This model uses Baye’s rule to calculate the probability that a sample unique is also a population unique: P( F  1| f  1)  p1P( f  1| F  1) where  p j P( f  1| F  j) j p j is the proportion of the equivalence classes in the population that are of size j , p j is estimated using the sample, and values of  P( f  i | F  j ) follows a hypergeometric distribution for all j . The number of population uniques becomes: S1  s1 P ( F  1| f  1) N . n Pitman model: the pitman model is defined by: Sj  (   )(  2 )...(  ( K  1) ) N  (1   )[ j 1]  1 P( S1 , S 2 ,..., S N )  N !    S !  (  1)...(  N  1) j! j 1   j where (1   )[ j 1]  (1   )(2   )...( j  1   ) , and  and  are real parameters describing the sampling scheme in the Pitman model (refer to [66] for more details). The population uniques are 1/3: 687319087 Dankar et al: Estimating the Re-identification Risk of Clinical Data Sets: A Simulation Study Appendix A: Uniqueness Estimation Models then estimated as: E ( S1 )  (  1)  N where  is the gamma function. The Pitman model (   ) has the exchangeability property with respect to individuals in the population. Hoshino uses this property to construct the Maximum Likelihood Estimators of  and  [66]: n 1 L u 1 1 1   0  i 1   i i 1   i n i 1 L u 1 i 1    si  0  i 1   i i  2 j 1 j   With:   (   )(  2 )...(  (u  1) ) n  (1   )[ j 1]  s j 1  L  Log  n !    s !    (  1)...(  n  1) j! j 1   j   u 1 n 1 n  j 1   const   log(  i )   log(  j )   s j   log(i   )  i 1 j 1 j 2  i 1  Where const is a value not depending on  or  . Hoshino then solves the above equations by the Newton-Raphson method using the second derivates:   2L 2 L 2 L and with starting values: ,  ( ) 2 ( ) 2  ( s1  n)  (n  1) s1 nuc  s1 (n  1)(2u  c) s ( s  1) and   , where c  1 1 . nu 2 s1u  s1c  nc s2 Slide negative binomial model: This model assumes a slide negative binomial distribution for the population cell frequencies: P( Fk  y )  (  ( y  1))   (1   ) y 1 , where ( )( y  1)!  and  are the parameters of the gamma distribution that models the expected population cell frequency. The expected number of uniques in the population is then shown to be: 2/3: 687319087 E ( S1 )  K   . To Dankar et al: Estimating the Re-identification Risk of Clinical Data Sets: A Simulation Study Appendix A: Uniqueness Estimation Models estimate the  and  parameters, the following equations need to be solved numerically:      (1   )(1   )   s1  K    1   1  (1   )(1  )   1 (1  )(1  )     2 (1   ) s2  K 2  (1   )(1   )(1  )  2  2 1  (1   )(1  ) In our implementation we used the modified Shlosser estimate for K described in [95]. This particular estimate was the most likely to result in convergence of the SNB model based on our simulations.  Mu-Argus: This model has not been used in the context of population uniques estimation.  P( F However it can be used to calculate k  1| f k  1) , i.e., the expected number of sample k uniques that are population uniques. The total number of population uniques can then be estimated from this quantity as is the case with the equivalence class model above, i.e. S1   P( Fk  1| f k  1) k Fk / f k N n It a model with the assumption: NB( f k , pk )  f k , this is the number of trials until f k successes occur with the probability of success being  Benedetti proposed [65]: pk  pk . To calculate P( Fk  1| f k  1) , one needs to estimate pk . fk  Fk where proposes  where Fk D D   wi is the initial estimate for the population, i wi are the sampling weights. However, since we use simple random sampling we have  n  p k N . And the number of sample uniques that are population uniques becomes   P( Fk  1| fk  1)   pk  s1 . k f k 1 3/3: 687319087

file - BioMed Central

Related documents

Products

Support

file - BioMed Central

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib