PFORMANCE EVALUATION OF AUTOMATIC &PEAKER RECOGNITION I), Venugopal and v,V,S, SCHEMES Sarrna Indian Institute of Science Bangalore 560012 ABSTRACT A mathematicalformulation of an automatic speaker verificationscheme as a two class pattern recognition problem is presented. Expressions for the expe cted values and the variance of the design—set and the set error rates are derived, The bound on the performance of an automatIc speaker id- test entification system as a cascade of independent verification systems is derived, The implications of these results in the design of az automatic sp eaker recognition, system are discussed. I INTRODUCTION The problem of performance evaluation of any automatic speaker verification system (ASVS) Is yet to be satisfactorily SQlved. In general pattern recognition literature, the performance estimation has received considerable notice recenty and the in the design of an AEVS is discussed in a recent note of importance of these results the authors'. In this paper, an ASVS a 2—class bring out is analysed as M known customers parameters on the performance estimation following the mathematical model of verification system provided by aDixon patern • The expected error rates for both the and the test-set are derived as adesign-set function of the number of samples per speaker (N), the number of features (L) ,the number of customers (M), the number of design impostors (K) and the Mahalanobis distance () between the classes under Gaussian assumptions, The variance of the designst error rate is derived bringing out the importance of choosing sufficient number of Impostors at the system design stage, The expected error rates for an automatic speaker identification system (ASI) as a cascade of N independent ASVS are derived• The importance of these resalts in the design of an automatic speaker recognition system is pointed out. MATHEMATICAL NODEL FOR ASVS and S all unknown a single alien group, ofil,...,MJ S0 speakers (rest of the world) This momion of an alien of disgroup speakers tinguishes an ASVS from an ASIS and is essential as there is always a chance of a person not belonging to the set S trying to impersonate as one of S. Corresponding to each member of S, there is a label L.(j= If any speaker wants to be veifasd under the label L1, then we define classes: the C. ciss consisting of the speaker S• and theimposter class C' consisting the remaining (N—i) speaers and the alien group S. two Cj={s;cj=so,s, pattern recognition problem to explicitly the effect of the va- rious II predetermined phrase, He also presents a label claiming that he is a particular ttcustomeru belonging to the system. In the system, a predetermined set of features, possibly depending on the label entered, is extracted from the utterance and the speaker is either accepted or rejected. Let the ASVS be designed for a set of The system on the i=1,,,•,M, j=l, . . basis of the feature veassign the speakers to or Cj. The accept/rejectrule using the optimal Bayesian classifier is to to accept the claim of a particular speaker as valid if ctor X of dimension L C will P(C/L ,X) > P(C/L ,x) and reject otherwise, ven The (1) it aposteriori probabilities are giP(X/L,Cj)P(L/C)P(C) /P(L,X) p /(L(2b),X) P(Cj/L,X)= p(X/Lj where the probabilities have the usual mep(L./C.) is the probability of be verified under speaker waa.jn his libel L. and p(L/C.) is the probability of an ".mpostor" trng to present label While the values of the anings and b to on L., actual variousprobabilities of (2a)a and (2b) depend upon the conditions in particular environment, we may assume in most cases In an ALVS, a given speaker not necea ssarily belonging to the system, utters 780 p( )'>p(C1), a piori probabdities for all A reasonable and assumspeakers1 assumption considering the above inequalities is (L./C.)p(Cj)= C1 if dtXe, i.e. O ing3equal p(L/C1)p(C). Again, it may e ostulated ifi the presence of the knowledge of class Ci the feature vector is independent of the label L, p(X/Lj,Cj)=p(X/Cj). The decan now be rewritten as cision rule that £ccept (l if (3) p(X/L3,C3)>p(X/L3,C3) and reject otherwise, The class-conditional densities in (3) are given by pc/C3)=pç/S3) N = l,...,M gn set)0 (4) T=pr where ass C1. pj) calculated exactly. Otherwise are to be estimated from N labeli=l,...,M led samples of each speaker of sot 8 where as p(X/80) is to be estimated from a set of be "impostor references" (of speakers of Just as the number of training samples per class is finite, the number of impostors(I to that can be considered represent the gris also finite. Nature of the Class-ConditionalDensities: Assumption: p(X/C3)'N(3,E) and p(X/C) where E3=E,E3=and p=M+K- are N(3, the S known covariance matrices of the two cland h3 are asses C and C and the means to be estirnate. from N design simples of_ class C and pN design samples of class C. Remark: It may not be unreasonable to assf.h. ume p(X/S-)=p(X/C1) to be Gaussian. Then p/) i a finite mixture of Gaussians, This will be a multimodal distributionfor small N and K. Again it may not be unreasonable, for large N and K, to fit a Gaussian distribution to samples of class C3. Classifier 2 For notational convenience,we denote C3 by Cl and C3 by C2 and p(X/C1)r" this caN(1,E) and p(X/C2)-" N'2,pE). Formatrices se of unequal means and covariance the minimum probability o± error classifier is the one using quaciratic diecriminaxtt funlinear disction. In this paper, the minimax criminant with eual error rate is used for further analysis. We define a linear die criminant d = ' —l ) where * = t+(l-t) (5) where t 1 is a u O=d' (B1"2 Therefore, •I- h)/(l/2+l) a sample X X- — I (8) — Proposition 1 2 CT in (8) may be expressed as the prbability of the ratio of two nonand '2 being greater central variates than the quantity (l-l)/(l+fl). (9) (l-P1)/(l+P1)] where and °2 are distributed as and 2(L,2). 2 X1= [2(l+P1)] pNjq3+l)- [1+2+p+l)2NiJ 2 x= [2(1-F1)jN ( +lYi-[l+2(p+i) 2N]} z2 = Mahalanobis squared distance between t the two populations = — (3/2..j) (p+l) (l+2+(p÷l) 2N) Proof : (The proof follows that of Moron4. We define two random vectors u and v such (h) that u= (pN/p+l)' (E') i+p (h-) and ' +(+l) 2N)T 2u1+u2 where Then T can be written as(u+v)-(u-v)' T=pr(u'v O)=prj(u+v)' (u+v) and (u-v) (u-v)O1. are die tributed independe- with dispersion matrices 2(l+Pl)IT ntly and where is the corre1aion 2(l-P1)I, coefficient etween P1 corresponding pair of is the LXL unit elements of u and v and matrix0 Thus (u+v)1 (u=v)/(l+Pi) w1—(-) and m94(u_v)t (uv)/(1—P), eqn.(9) follOws. if, Detailed proof is given in reference 5. Proposition 2 : It is also possible to express the expeced error rate CT in closed form expression parame- of our choice and =(l/N) E31 = and X2. where X1. is the (l/PN)E!i labelled i. or equal jth rate the sample of3class error threshold 9 is given by ter (E) 4! is X the feature vector corresponding to an arbitrary new utterance from cl- i=O,ij where p(X/5j), i=O,l,,,.,M are speaker-conditional densities of the feature vector X, If the distributions p(X/$1),i=O,..M are completely known, then the error rate oup (test S ances used Jor designing the system (desiTest-set error rate: The expected test—set error rate () may be written from (7) as pOc/s1)p(S) can (7) and to class C2, otherwise, III, TEI AND DESIGN-SE7 ERROR RATES FOR AN 81TS The ASVS may be tested either by new utterances of the speakers belonging to S and set) or by the sample utter- (6) T is assigned to class 781 = Q(L,X1, 2' where (10) 0+02 P1)1-c(e1,o2)-exp(- 2 (eie2) { B;1p)(jL+in, 'm m=l4L12 e2(1- 1)c(e,) is the the circular coverage function, I (z) modified Bessel function of fir& kind B(p,q) the incosiplote p-function. The upm 0 and the lower sign per sign is for for m<0, Design—set error rate sign set error rate from eqn,(7) as — : The expected However, (12) feature vector correspondM utterance from the dearbitrary sign set of class where X. ing to is ' the 0-4 cc variates m and w being grea- =pr[o3/w4 (l-?2)/(l+F2)7 and w and 0 (13) are distributed as be seen from (15) that a large N, large K is not O \rEsTEr -Z TLTS DE5J SET cc ter than the quantity where -\ C1, in (12) may be expressed Proposition 3: as the proabilityof the ratio of two noncentral It may essential, 0-5 - ol +1 the biased and ar not reliable, mistically for de- may be written pr[(-) (E')X1— pulation, variance of the design-set error rate is inversely proportional to the number of recorded sample utterances for speaker and the total number of speakers including the design impostors, Fig, 2 and (15) show that for small number of customers (M if sufficienily large number of_design_impostors and are opti(K) are not used, both (11) jLm)5m03. where0=(l+ mistically biased, Fig, 2 shows that for an SV$ the expected error rates become independent of population size for large po- Cii Ui 0.2 %2(L,213) C— U (+1)2NJ = [2(1-r2) : 0 Ui oc E2 p+1) -3/2(+2) 2 + 2 It is EJ IL 6 7 6 Rates as Function of N/L TEST SET in a cloAC-J sed form expression k repectivly, fined in (lL with K. and replaced by )4 an F2 Proposition 5: The variance dom variable eD given by and (14) de- 0 a: being Cii _2 of the ran- 4r Ui C— (i-c )/(p+l)N (is) proof follows that of Foley and is given in reference 5, and -. In Fig. 1 and 2 the values of are plotted as functions of N/i and B. Fig, 1 gives the nature of the biases that creep into the eatimates for small N/L.The test—set error rate is pessimistically biThe design—set error lL4 UI = Q(L,)3, where X4, F), ) is the same function as ased and the 5 also possible to ex- press the expected rror rate is 4 3 The proof is similar to that of proposition 1, Proposition4; = 0 Fig,l-Expected Error 2 2 Proof Ui + x342(l+F2)]8s+l) rate is opti 782 'C Lu I0 20 30 40 50 60 70 80 0 Fig,2-Expected Error Rates as Functions of Population Size PERFORM1NCE EVALUATION OF IV IS of s ASIS can be reali2:ed as a cascade (N-i) or N ASVS's as shown in Fig.3All the A$VSs are assumed to have identifical performance. If there is a reject option at Mth stage also the possibility of (14+1) classes corresponding to N customers and an alien class (as not belonging to the system) can be introduced in an ASIE as well, On the other hand, the decision can be terminated at (M-l)th stage and speaker 814 can be accepted. Let p be the probability of error and q the probability of correct decision of an ABVS, Assuming that the jth speaker has test,ed the system, we can draw the decision tree as shown in Fig•4, If D1, i l,.,.,M is the decision taken by the system at the ith stage that the speaker is S, then we can write the probability of correct decision s =E E P(D./S.)P(S.) j= .3 equal (j/) ( j=1 q) (,/) ( qM/ (1-q) (17) (19) shows the effect of populaon the performance of an ASIS,, thus corroborating Dodding ton' s results '. Equation tion si7e V Nov. DISCUSSION OF RESULTS of The design an ASVS proceeds in three steps: (i) Data base preparation, feature selection and extraction and statistical classification and performance evaluation. All the stages are,of The sectcourse, fication into (iii) results of interrelated. ion III provide information on: (i)Preparation of data set (number of design samper speaker (N)),(ii) The pie utterances dimension of the feature vector (L). If niL ratio is small there will be wide disin performance estimates that wifl. parities be if the is tested on the design 1969. 3. T,W,Anderson and R.R.Bahadur, "Classitwo multivariate normal distributions with different covariance matricest', Ann, of Math, Statist,, Vol.33 pp 420-431, June 1962. (ii) obtained for 62, pp 141-148, Apr. 1975, 5. V,V.S,Sarma and B, Venugopal, "Statistical problems in performance assessment of Automatic Speaker Recognition Systems" CI? Report No,61, Dept, of BCE, Indian Inst0 of Science, India, Jan. 197?, 6, D.H, Foley, "Considerations of sample and feature size", IEEE Trans,Information Theory, Vol.17-18, pp 618-626,Sept,1972. 7, A,E,Rosenberg, "Automatic speaker yen— fication: a review", Proceedings IEEE, Vol.64, pp 475—487, Apr. 1976, (16) .3 Assuming apriori probabilities for all speakers belonging to S and from Fig. 4, we can write = 2 4. M,A.Moron, "On the expectation of erro ro of allocation associated with a linear diseriminantfunction", Biome trika, Vol. N P(5.,D.) = complex for small or large N because of the presence of the alien class, REFERENCES 1, V.1T.S, Sarma and D, Venugopal, "Performance evaluation of automatic speaker verification systems", IEEE Trans,Acoustics, Speech, Signal Processing (to be published). R,C.Dixon and P,E. Bourdeau, 'tMathe matical model pattern verification", IBM J. of R and D, Vol, 13, pp 717-721, I system set or on an independent test set, (iii) The discriminating ability of a feature depends on the appropriate distance the unbetween the classes concerned. derlying distributions are Gaussian the distance between classes itself provides an estimate of error, It should be kept in mind, however, that the distance estimate number of samples per class from a is a biased estimate of the true distance between the populations, (iv) The error and are functions of rates (Fig.2), Even if an ASV is to be designed pfor a small number of customers N, a sufficiently large number K of impostors should be considered the design set so as to make the If -1 cI ' )40 0UVt > —--i 'r tnI kio r1 4o u-I >1 w YE5 YE DECISJOM Fig,3 - ASIS as a Cascade of ASVS finite Dj in estimates reliable. For large be so important. The design of a verification system is thus equally Fig,4 — performance N this may not 783 Decision Tree for Speaker j