Dankar et al: Estimating the Re-identification Risk of Clinical Data Sets: A Simulation Study Appendix A: Uniqueness Estimation Models Appendix A: Uniqueness Estimation Models We will first introduce some notation then briefly describe all the estimation models evaluated in this paper. Let N and n be the size of the population and the sample respectively, K and u denote the number of non-zero equivalence classes in the population and the sample respectively, and Fi and f i denote the size of the i th equivalence class in the population and the sample respectively, where i {1,...K } ( {1,..., u} respectively). Also, let Si and si be the number of equivalence classes of size i in the population and the sample respectively, and let be the sampling fraction. We define P( F 1| f 1) as the probability that an equivalence class of size 1 in the sample was chosen from an equivalence class of size 1 in the population. Note, however, that estimating the number of sample uniques that are populaiton uniques does not tell us how by itself many uniques are in the population. Below is a brief description of the models evaluated using the above notation: Equivalence class model (Zayatz): This model uses Baye’s rule to calculate the probability that a sample unique is also a population unique: P( F 1| f 1) p1P( f 1| F 1) where p j P( f 1| F j) j p j is the proportion of the equivalence classes in the population that are of size j , p j is estimated using the sample, and values of P( f i | F j ) follows a hypergeometric distribution for all j . The number of population uniques becomes: S1 s1 P ( F 1| f 1) N . n Pitman model: the pitman model is defined by: Sj ( )( 2 )...( ( K 1) ) N (1 )[ j 1] 1 P( S1 , S 2 ,..., S N ) N ! S ! ( 1)...( N 1) j! j 1 j where (1 )[ j 1] (1 )(2 )...( j 1 ) , and and are real parameters describing the sampling scheme in the Pitman model (refer to [66] for more details). The population uniques are 1/3: 687319087 Dankar et al: Estimating the Re-identification Risk of Clinical Data Sets: A Simulation Study Appendix A: Uniqueness Estimation Models then estimated as: E ( S1 ) ( 1) N where is the gamma function. The Pitman model ( ) has the exchangeability property with respect to individuals in the population. Hoshino uses this property to construct the Maximum Likelihood Estimators of and [66]: n 1 L u 1 1 1 0 i 1 i i 1 i n i 1 L u 1 i 1 si 0 i 1 i i 2 j 1 j With: ( )( 2 )...( (u 1) ) n (1 )[ j 1] s j 1 L Log n ! s ! ( 1)...( n 1) j! j 1 j u 1 n 1 n j 1 const log( i ) log( j ) s j log(i ) i 1 j 1 j 2 i 1 Where const is a value not depending on or . Hoshino then solves the above equations by the Newton-Raphson method using the second derivates: 2L 2 L 2 L and with starting values: , ( ) 2 ( ) 2 ( s1 n) (n 1) s1 nuc s1 (n 1)(2u c) s ( s 1) and , where c 1 1 . nu 2 s1u s1c nc s2 Slide negative binomial model: This model assumes a slide negative binomial distribution for the population cell frequencies: P( Fk y ) ( ( y 1)) (1 ) y 1 , where ( )( y 1)! and are the parameters of the gamma distribution that models the expected population cell frequency. The expected number of uniques in the population is then shown to be: 2/3: 687319087 E ( S1 ) K . To Dankar et al: Estimating the Re-identification Risk of Clinical Data Sets: A Simulation Study Appendix A: Uniqueness Estimation Models estimate the and parameters, the following equations need to be solved numerically: (1 )(1 ) s1 K 1 1 (1 )(1 ) 1 (1 )(1 ) 2 (1 ) s2 K 2 (1 )(1 )(1 ) 2 2 1 (1 )(1 ) In our implementation we used the modified Shlosser estimate for K described in [95]. This particular estimate was the most likely to result in convergence of the SNB model based on our simulations. Mu-Argus: This model has not been used in the context of population uniques estimation. P( F However it can be used to calculate k 1| f k 1) , i.e., the expected number of sample k uniques that are population uniques. The total number of population uniques can then be estimated from this quantity as is the case with the equivalence class model above, i.e. S1 P( Fk 1| f k 1) k Fk / f k N n It a model with the assumption: NB( f k , pk ) f k , this is the number of trials until f k successes occur with the probability of success being Benedetti proposed [65]: pk pk . To calculate P( Fk 1| f k 1) , one needs to estimate pk . fk Fk where proposes where Fk D D wi is the initial estimate for the population, i wi are the sampling weights. However, since we use simple random sampling we have n p k N . And the number of sample uniques that are population uniques becomes P( Fk 1| fk 1) pk s1 . k f k 1 3/3: 687319087