jbi12058-sup-0001-AppendixS1-S3

advertisement
SUPPORTING INFORMATION
Selecting thresholds for the prediction of species occurrence with presence-only
data
Canran Liu, Matt White and Graeme Newell
Journal of Biogeography
Appendix S1 Theoretical explanation for some threshold selection methods
The evaluation data may be collected in two ways: (1) one random sample, or (2) two
separate random samples.
(1) We assume that the data set is composed of points completely randomly sampled
from the study area. Within this data set, the proportion of true presences is p (0 < p <
1), where p is the species’ overall prevalence; the proportion of true absences is 1 − p.
Let Se and Sp represent the true sensitivity and true specificity of the model. For this
data set with true presences and true absences, we can construct a contingency table
as follows:
Observed
Predicted
Presence
Absence
Total
Presence
p11  pSe
p01  (1  p)(1  Sp)
p1  pSe  (1  p)(1  Sp)
Absence
p10  p(1  Se)
p00  (1  p) Sp
p0  p(1  Se)  (1  p) Sp
p1  p
p0   1  p
1
Total
However, if only part of the true presences are known with a proportion p − r (0 <
r < p < 1), the other part of the true presences with a proportion r are unknown, and
this part of the true presences and all the true absences together are considered as
pseudo-absences, which account for a proportion 1 − p + r altogether. In the
following, we will mark those metrics calculated with presence-only data with the
prime symbol. Then a contingency table for the presence-only data with population
parameters can be constructed as follows:
Observed
Predicted
Presence
Absence
Total
Presence
( p  r ) Se
rSe  (1  p )(1  Sp )
p1  pSe  (1  p)(1  Sp)
Absence
( p  r )(1  Se)
r (1  Se)  (1  p ) Sp
p 0  p(1  Se)  (1  p) Sp
p1  p  r
p0   1  p  r
1
Total
From the above two tables, we can see that p+1 = p′+1 and p+0 = p′+0, which will be
used in the following derivation. We can also derive
Sp   [r (1  Se)  (1  p ) Sp ] /(1  p  r ) .
(2) In reality, the presence-only data set may contain two components, which are
usually sampled separately. One is the presence component containing true presences,
and the other is the absence component containing both true presences and true
absences with proportions s and 1 − s (0 ≤ s < 1), respectively. The points in the
absence component can be obtained in different ways. The simple way is to randomly
sample some points from the study area. We further assume that if we put both the
presence and absence components together, the former takes a proportion q and the
latter takes a proportion 1 − q. Then, the contingency table can be constructed as
follows:
Observed
Predicted
Presence
Absence
Total
Presence
qSe
(1  q)[ sSe  (1  s )(1  Sp )]
Se  (1  q)(1  s )( Se  Sp  1)
Absence
q(1−Se)
(1  q)[ s(1  Se)  (1  s ) Sp ]
1  Se  (1  q)(1  s )( Se  Sp  1)
q
1−q
1
Total
If we change the parameters in this table such that q = p − r and s = r / (1 − p + r),
this table will be changed exactly to the previous table. In the following, our
explanation will be based only on the first table.
1. Maximizing OA
OA  ( p  r ) Se  r (1  Se)  (1  p ) Sp
 pSe  (1  p ) Sp  r (1  2 Se)  OA  r (1  2 Se)
Thus, maximizing OA′ may be inconsistent with maximizing OA.
2. Minimizing the difference between sensitivity and specificity
Because
DSS   Se  Sp  Se  [r (1  Se)  (1  p ) Sp ] /(1  p  r )
 Se  Sp  ( Se  Sp  1)[r (1  p  r)]
 [(1  p ) /(1  p  r )]( Se  Sp )  [r /(1  p  r )]( 2Se  1)
 [(1  p ) /(1  p  r )]DSS  [r /(1  p  r )]( 2Se  1)
if r ≠ 0, minimizing DSS′ = Se′ − Sp′ may be inconsistent with minimizing DSS = Se
− Sp because Se − Sp and Se might not be minimized at the same time. This means
that with the sensitivity–specificity difference minimization method, by using
presence-only data we might not get the same threshold as that by using
presence/absence data.
3. Maximizing kappa
kappa 
OA  EA
, where
1  EA
OA  pSe  (1  p) Sp  p1  (1  p)  2(1  p)Sp ,
p1  pSe  (1  p)(1  Sp) ,
and
EA  pp1  (1  p)(1  p1 )  1  p  (2 p  1) p1 .
Thus,
OA  EA  2(1  p)  2(1  p) p1  2(1  p)Sp
 2(1  p)( p1  Sp  1)
 2(1  p)[ pSe  (1  p)(1  Sp)  Sp  1]
 2 p(1  p)( Se  Sp  1)
 2 p(1  p)TSS .
kappa 
OA  EA
, where
1  EA
OA  ( p  r ) Se  (1  p) Sp  r (1  Se)  OA  r (1  2Se) ,
EA  ( p  r ) p1  (1  p  r )(1  p1 )  EA  r (1  2 p1 ) .
Thus,
OA  EA  OA  EA  r[(1  2Se)  (1  2 p1 )]
 2 p(1  p)TSS  2r ( p1  Se)
 2 p(1  p)TSS  2r[ pSe  (1  p)(1  Sp)  Se]
 2 p(1  p)TSS  2r (1  p)TSS
 2(1  p)( p  r )TSS ,
and
kappa  kappa 
OA  EA OA  EA

1  EA
1  EA

(1  EA)(OA  EA)  (1  EA)(OA  EA)
(1  EA)(1  EA)

U
, where
V
U  (1  EA)2(1  p)( p  r )TSS  (1  EA)2 p(1  p)TSS
 2(1  p)TSS[( p  r )(1  EA)  p(1  EA)]
 2(1  p)TSS{( p  r )(1  EA)  p[1  EA  r (1  2 p1 )]}
 2(1  p)TSS[r  rEA  pr(1  2 p1 )]
 2(1  p)TSS{r  r[1  p  (2 p  1) p1 ]  pr(1  2 p1 )}
 2(1  p)rp1TSS
kappa   kappa 
2(1  p )rp 1TSS
U
 kappa 
V
(1  EA) 2  (1  EA)( 2 p1  1)r
Therefore, maximizing kappa′ may be inconsistent with maximizing kappa, because
EA, TSS and p+1 are not constant.
4. Maximizing F
PPV 
pSe
p 1
PPV  
( p  r ) Se pSe  rSe pSe rSe
rSe



 PPV 
p 1
p 1
p 1 p 1
p 1
F
1
1 1
1 
 

2  PPV Se 
F 

1
1 1
1 
 

2  PPV  Se 
2 SePPV
2 Se( pSe / p1 )
2 pSe
, and


Se  PPV
Se  pSe / p1
p  p1

2Se( p  r ) Se / p 1
2SePPV 
2( p  r ) Se
2 pSe  2rSe



Se  PPV  Se  ( p  r ) Se / p 1 ( p  r )  p 1
p  p 1  r
Thus,
F  F 
2rp 1Se
.
( p  p1 )( p  p1  r )
Therefore, maximizing F′ may be inconsistent with maximizing F.
5. Minimizing the distance between ROC curve and the upper-left corner of the
plot square
With presence/absence data, the distance from a point on the ROC curve to the upperleft corner of the plot square is D01  (1  Se) 2  (1  Sp ) 2 . The larger the Se and Sp,
the smaller the D01. Generally, minimizing D01 will result in higher Se and Sp.
With presence-only data, however (see Appendix 1 for the explanation of the data
composition), the distance becomes
  (1  Se) 2  (1  Sp ) 2
D01
 (1  Se) 2  [1  r (1  Se)  (1  p )(1  Sp )]2
 (1  Se) 2  {rSe  [(1  r )  (1  p ) Sp ]}2 .
Because r < p, 1 − r > 1 − p > (1 − p) Sp, and therefore, (1 − r) − (1 − p) Sp > 0.
When r is small, the situation will be similar to the above, and minimizing D′01 will
generally produce similar results to those minimizing D01; and the smaller the r, the
closer the results, but when r is large, the minimization of D′01 is strongly dependent
on r, which generally cannot produce similar results to those minimizing D01; in this
case, higher Se cannot be guaranteed and intermediate-sized Se is usually obtained.
The reason is that minimizing D′01 requires both 1 − Se and r Se + [(1 − r) − (1 − p)
Sp] to be minimized. From the latter, we know Sp will be maximized and Se will be
minimized, but from the former, Se will be maximized. So, there must be a trade-off
for Se, and finally an intermediate-sized rather large Se will be obtained.
Appendix S2 Examples of simulated species distributions for three levels of
prevalence (0.05, 0.25 and 0.75) and two types of species–environment relationships
[linear for Mahalanobis distance (MD) and ecological niche factor analysis (ENFA)
models and non-linear for generalized additive models (GAM) and random forest (RF)
models].
Appendix S3 Difference of threshold between that calculated with manipulated and
unmanipulated independent data for three levels of species prevalence (0.05, 0.25 and
0.75) for MD, GAM_POf, RF_PA and RF_POf models from Group 1 simulations (i.e.
1000 different species). See Fig. 1 for more information.
Download