SUPPORTING INFORMATION Selecting thresholds for the prediction of species occurrence with presence-only data Canran Liu, Matt White and Graeme Newell Journal of Biogeography Appendix S1 Theoretical explanation for some threshold selection methods The evaluation data may be collected in two ways: (1) one random sample, or (2) two separate random samples. (1) We assume that the data set is composed of points completely randomly sampled from the study area. Within this data set, the proportion of true presences is p (0 < p < 1), where p is the species’ overall prevalence; the proportion of true absences is 1 − p. Let Se and Sp represent the true sensitivity and true specificity of the model. For this data set with true presences and true absences, we can construct a contingency table as follows: Observed Predicted Presence Absence Total Presence p11 pSe p01 (1 p)(1 Sp) p1 pSe (1 p)(1 Sp) Absence p10 p(1 Se) p00 (1 p) Sp p0 p(1 Se) (1 p) Sp p1 p p0 1 p 1 Total However, if only part of the true presences are known with a proportion p − r (0 < r < p < 1), the other part of the true presences with a proportion r are unknown, and this part of the true presences and all the true absences together are considered as pseudo-absences, which account for a proportion 1 − p + r altogether. In the following, we will mark those metrics calculated with presence-only data with the prime symbol. Then a contingency table for the presence-only data with population parameters can be constructed as follows: Observed Predicted Presence Absence Total Presence ( p r ) Se rSe (1 p )(1 Sp ) p1 pSe (1 p)(1 Sp) Absence ( p r )(1 Se) r (1 Se) (1 p ) Sp p 0 p(1 Se) (1 p) Sp p1 p r p0 1 p r 1 Total From the above two tables, we can see that p+1 = p′+1 and p+0 = p′+0, which will be used in the following derivation. We can also derive Sp [r (1 Se) (1 p ) Sp ] /(1 p r ) . (2) In reality, the presence-only data set may contain two components, which are usually sampled separately. One is the presence component containing true presences, and the other is the absence component containing both true presences and true absences with proportions s and 1 − s (0 ≤ s < 1), respectively. The points in the absence component can be obtained in different ways. The simple way is to randomly sample some points from the study area. We further assume that if we put both the presence and absence components together, the former takes a proportion q and the latter takes a proportion 1 − q. Then, the contingency table can be constructed as follows: Observed Predicted Presence Absence Total Presence qSe (1 q)[ sSe (1 s )(1 Sp )] Se (1 q)(1 s )( Se Sp 1) Absence q(1−Se) (1 q)[ s(1 Se) (1 s ) Sp ] 1 Se (1 q)(1 s )( Se Sp 1) q 1−q 1 Total If we change the parameters in this table such that q = p − r and s = r / (1 − p + r), this table will be changed exactly to the previous table. In the following, our explanation will be based only on the first table. 1. Maximizing OA OA ( p r ) Se r (1 Se) (1 p ) Sp pSe (1 p ) Sp r (1 2 Se) OA r (1 2 Se) Thus, maximizing OA′ may be inconsistent with maximizing OA. 2. Minimizing the difference between sensitivity and specificity Because DSS Se Sp Se [r (1 Se) (1 p ) Sp ] /(1 p r ) Se Sp ( Se Sp 1)[r (1 p r)] [(1 p ) /(1 p r )]( Se Sp ) [r /(1 p r )]( 2Se 1) [(1 p ) /(1 p r )]DSS [r /(1 p r )]( 2Se 1) if r ≠ 0, minimizing DSS′ = Se′ − Sp′ may be inconsistent with minimizing DSS = Se − Sp because Se − Sp and Se might not be minimized at the same time. This means that with the sensitivity–specificity difference minimization method, by using presence-only data we might not get the same threshold as that by using presence/absence data. 3. Maximizing kappa kappa OA EA , where 1 EA OA pSe (1 p) Sp p1 (1 p) 2(1 p)Sp , p1 pSe (1 p)(1 Sp) , and EA pp1 (1 p)(1 p1 ) 1 p (2 p 1) p1 . Thus, OA EA 2(1 p) 2(1 p) p1 2(1 p)Sp 2(1 p)( p1 Sp 1) 2(1 p)[ pSe (1 p)(1 Sp) Sp 1] 2 p(1 p)( Se Sp 1) 2 p(1 p)TSS . kappa OA EA , where 1 EA OA ( p r ) Se (1 p) Sp r (1 Se) OA r (1 2Se) , EA ( p r ) p1 (1 p r )(1 p1 ) EA r (1 2 p1 ) . Thus, OA EA OA EA r[(1 2Se) (1 2 p1 )] 2 p(1 p)TSS 2r ( p1 Se) 2 p(1 p)TSS 2r[ pSe (1 p)(1 Sp) Se] 2 p(1 p)TSS 2r (1 p)TSS 2(1 p)( p r )TSS , and kappa kappa OA EA OA EA 1 EA 1 EA (1 EA)(OA EA) (1 EA)(OA EA) (1 EA)(1 EA) U , where V U (1 EA)2(1 p)( p r )TSS (1 EA)2 p(1 p)TSS 2(1 p)TSS[( p r )(1 EA) p(1 EA)] 2(1 p)TSS{( p r )(1 EA) p[1 EA r (1 2 p1 )]} 2(1 p)TSS[r rEA pr(1 2 p1 )] 2(1 p)TSS{r r[1 p (2 p 1) p1 ] pr(1 2 p1 )} 2(1 p)rp1TSS kappa kappa 2(1 p )rp 1TSS U kappa V (1 EA) 2 (1 EA)( 2 p1 1)r Therefore, maximizing kappa′ may be inconsistent with maximizing kappa, because EA, TSS and p+1 are not constant. 4. Maximizing F PPV pSe p 1 PPV ( p r ) Se pSe rSe pSe rSe rSe PPV p 1 p 1 p 1 p 1 p 1 F 1 1 1 1 2 PPV Se F 1 1 1 1 2 PPV Se 2 SePPV 2 Se( pSe / p1 ) 2 pSe , and Se PPV Se pSe / p1 p p1 2Se( p r ) Se / p 1 2SePPV 2( p r ) Se 2 pSe 2rSe Se PPV Se ( p r ) Se / p 1 ( p r ) p 1 p p 1 r Thus, F F 2rp 1Se . ( p p1 )( p p1 r ) Therefore, maximizing F′ may be inconsistent with maximizing F. 5. Minimizing the distance between ROC curve and the upper-left corner of the plot square With presence/absence data, the distance from a point on the ROC curve to the upperleft corner of the plot square is D01 (1 Se) 2 (1 Sp ) 2 . The larger the Se and Sp, the smaller the D01. Generally, minimizing D01 will result in higher Se and Sp. With presence-only data, however (see Appendix 1 for the explanation of the data composition), the distance becomes (1 Se) 2 (1 Sp ) 2 D01 (1 Se) 2 [1 r (1 Se) (1 p )(1 Sp )]2 (1 Se) 2 {rSe [(1 r ) (1 p ) Sp ]}2 . Because r < p, 1 − r > 1 − p > (1 − p) Sp, and therefore, (1 − r) − (1 − p) Sp > 0. When r is small, the situation will be similar to the above, and minimizing D′01 will generally produce similar results to those minimizing D01; and the smaller the r, the closer the results, but when r is large, the minimization of D′01 is strongly dependent on r, which generally cannot produce similar results to those minimizing D01; in this case, higher Se cannot be guaranteed and intermediate-sized Se is usually obtained. The reason is that minimizing D′01 requires both 1 − Se and r Se + [(1 − r) − (1 − p) Sp] to be minimized. From the latter, we know Sp will be maximized and Se will be minimized, but from the former, Se will be maximized. So, there must be a trade-off for Se, and finally an intermediate-sized rather large Se will be obtained. Appendix S2 Examples of simulated species distributions for three levels of prevalence (0.05, 0.25 and 0.75) and two types of species–environment relationships [linear for Mahalanobis distance (MD) and ecological niche factor analysis (ENFA) models and non-linear for generalized additive models (GAM) and random forest (RF) models]. Appendix S3 Difference of threshold between that calculated with manipulated and unmanipulated independent data for three levels of species prevalence (0.05, 0.25 and 0.75) for MD, GAM_POf, RF_PA and RF_POf models from Group 1 simulations (i.e. 1000 different species). See Fig. 1 for more information.