SUBJECTIVE AGE ESTIMATION OF TELEPHONIC VOICES Loredana Cerrato, Mauro Falcone, Andrea Paoloni Fondazione Ugo Bordoni via Baldassarre Castiglione 59, 00149 Roma Tel: +39.06.54801 Fax : +39.06.5480.4405 Email: {loredana, falcone, pao}@fub.it Abstract The aim of this study was to investigate the extent to which listeners can judge some unknown speakers' characteristics. In particular we are interested in investigating the speaker accuracy of assessing age and gender by only hearing the voice recorded over the telephone line. Moreover we tried to evaluate the reliability of subjective age estimation in the forensic context. The results of the statistical analysis we carried out show that listeners are capable of assigning a general chronological age category to a voice without seeing or knowing the talker and they are able to distinguish between male and female voices transmitted over the telephone line. Introduction Everyday experience allows us to state that it is possible to make an estimate of some of the characteristics of an unknown speaker by only listening to his/her voice. Some of these characteristics are related to speakers’ physiological characteristics, such as gender, age, weight, some other depend on their cultural and social background: geographical provenience, level of education, and also emotional state. Many studies have examined the extent to which listeners can judge a speaker's characteristics from voice cues only. The results show that gender can be accurately judged (Mullenix, 1995), for height and weight assessment there are different results: Lass (Lass et alii, 1980) reported that these two speakers' characteristics could be easily assessed even in different conditions of signal distortion; while more recently Kunzel (Kunzel, 1989) reported negative results for the reliability of height and weight estimation. Listeners appear to be able to judge also if a speaker is sober or intoxicated (Pisoni, 1989). Several researchers reported that listeners are able to make fairly accurate judgements about speakers' age from voice cues only. Both the results of tests requiring the subjects to indicate the age group of the target voice (Neiman 1990, this study) and to provide a direct age estimate (Ryan and Burk, 1974; Hory and Ryan, 1981; Braun, 1996), show consistent results indicating that estimates of age are generally accurate within a decade. Most of the previous studies on subjective age estimation used speech samples produced under controlled phonation, recorded in normal conditions (Shipp and Hollien, 1969; Linville and Fisher 1985; Ryan and Burk, 1974; Hory and Ryan, 1981; Neiman and Applegate, 1990; Braun, 1996). 1 We thought it could be also interesting to use voice samples produced in a natural way and recorded from telephone calls, as the application we envisage is forensic voice profiling in forensic, in particular for the critical task of identification of voices recorded by means of phone tapping . Experimental set up We used a set of 42 voice samples, which were equally balanced between the genders with 21 male and 21 female voices. The samples were randomly selected from the SIVA database (Speaker Identification and Verification Archive) gathered over the Italian public telephone line and consisting of 2000 phone calls made by 1000 speakers, equally distributed by sex, age and geographic area (Cerrato, 1997). The 42 voices were almost equally distributed in the three main Italian geographic areas: Northern, Southern and Central Italy, which coincide with the three main regional varieties of the Italian language. Italian is characterized by many dialects and regional varieties; the broadest way to point out the large geographical variability of the Italian language is to identify three main regional varieties, which correspond to the three main geographical subdivisions of the Italian territory (Rohlfs, 1996). Each speaker read the same short passage of the length of about 40 seconds, consisting of a fictitious curriculum vitae, from which it was not possible to deduce any personal information related to the speaker. Normally the response to the subjective age estimation task can be done in two different ways: indicating the age group of the target voice or providing a direct age estimate. For our investigation we followed the former procedure, establishing 7 age groups of 7 year-range as follow: 18-24, 25-31, 32-38, 39-45, 46-52, 53-59, 60-66. 3 male and 3 female speakers were selected for each age group for a total of 42 speakers. 17 listeners (male and female mostly university students) with no hearing pathologies took part to our experiment; they had no training in perceptual phonetics. They were paid the equivalent of 30Euro to carry out the test of the total length of 40 minutes, which was submitted in two parts with an interval of 10 minutes. The subjects listened, over headphones at a comfortable listening level (about 73 dBA at the ear reference point), to the 42 voices reading the same passage. They were required to indicate the perceived age group and the gender of each speaker, ticking the appropriate answer among the ones given in the answer sheet. Analysis of the data The data were analysed by means of statistical analysis in order to investigate the magnitude and the distribution of the errors in relation to the speaker’s age. This has been done by means of standard analysis and with the support of a series of graph and tables. A comparison of the obtained results with the achievement of other researchers has also been performed, in order to check the reliability of our experiment and, more importantly, to allow a comparison between data gathered from telephonic speech and normal speech. Several researchers (Neiman 1990, Braun, 1996) believe that a possible representation of the accuracy of perceived age estimation can be given by correlation or a similar estimator, which are ad hoc relationships between chronological age (CA) and perceived age (PA). When listeners are constrained to give their judgement according to age groups, chronological age is represented by the central value of the age class, while perceived age is represented by the mean value of all the 2 answers given by all the listeners. The percent of correspondence between CA and PA is always calculated dividing the lowest of these two values by the highest ones, thus obtaining a result necessarily minor or equal to 1. For example if the CA of a sample is 18-24 (mean value 21) and the PA is 32, the percent of correspondence is 21/32, that is 0.65 or as commonly reported a 65% of accuracy. As a consequence of this formulation a percent correspondence close to the unit means that listeners correctly judge the speaker’s age. On the contrary, as the value approaches 0, the estimation of the speaker's age gets progressively worse. In table 1 we show the percent correspondence between CA and PA in our experiment. In table 2, we summarize the values obtained by other researchers (for non telephonic voices). In the test we run,we find quite high values ranging from 68% and 78% that are the worst cases in correspondence of youngest and oldest age group, and reaching the highest value of 96% in the central group. A more widely spread and common measure, that reflects the extent of a linear relationship between two data sets, is the dimensionless index that ranges from -1.0 to 1.0 named Pearson product moment correlation. Also in this case the best result is represented by the value 1. This measure has been used to calculate the correlation between perceived age and chronological age by different researchers, on high quality speech material, as reported in table 3. In our case the Pearson correlation gives a coefficient equal to 0.77. Notwithstanding the same aim of these two measures, the ranking of subject judgment ability may not, as in the case, be the same as reported in table 2 and 3. For instance, the Pearson correlation in (Neiman, 1996) is higher than Pearson correlation in (Braun 1996), while for the percent correct correspondence we have an opposite situation. These measures only evaluate the subject’s judgment trend in relation to the distribution of speaker’s age and do not take into account the extent of the errors, which we may be interested in. In fact we analysed more in detail the errors, also in relation to the speaker’s age, and we investigated the effect of over and under estimation by listeners. In order to do this, we had to perform further analyses as the previous results were not able to describe that specific aspect. It is possible to consider 2 errors: the absolute error and the relative error. The former is more suitable to estimate the extent to which a subject is able to correctly assess a speaker voice, as it integrates errors as positive values without considering the sign of the distance between estimated age and chronological age. The latter is more suitable when we are interested in the extent to which a class of listeners is able to properly assign an age to an unknown voice. In this case it is possible to balance positive and negative errors, as we are looking at the estimate given by the pool of subjects rather than the estimate of single listeners. This last measure is more appropriate in the forensic environment. A simple but effective way to perform this analysis, is the mapping of perceived age and its error for each age class, as reported in figure1. It is clear from this graph that listeners tend to overestimate young speakers and to underestimate old speakers. This tendency was pointed out also by Ship and Hollien (Ship and Hollien, 1969) in their experiment in which subject performed a direct age estimation task. As we performed an age estimation test constrained by age groups, we also calculated the percentage of correct answer for each age class, reported in figure 2. The phenomenon of overestimation of young speakers and of underestimation of older speakers is more evident in the detailed set of figures 2a-2g, where we report the answer distribution for each of the seven age classes. The results show that listeners assigned the correct age class mainly to the age group 46-52 (figure 2e). Listeners very rarely guessed the age of young speakers (figure 2a) which were overestimated. Older speakers (figure 2f and2g) were underestimated, in particular for 3 the age group 60-66 (figure 2g) most of the answers fall in the age group 46-52, that is an underestimation of 14 years. This kind of representation (figure 2) seems to outline that the listeners’ performance is not very good, as on average the overall percentage of correct answers we obtain is only 27%. It shouldn't be a surprise that these two performed analyses give apparently inconsistent results, in fact correlation-like analyses give quite good results, while the value of overall percentage of correct answers is very low. Again we wish to underline that these two analyses aim at different investigation so their results do not necessarily have to be similar The results of the statistical analysis we carried out show that listeners were remarkably correct at estimating the age of some speakers, but not that of others. In general they are able to assign an unknown voice to an age group, in fact if the age classes had a bigger range (21 years) the percentage of correct answers would result higher. This is suggested by the representation reported in figure 3, where both the percentages of correct answers for each age class and the percentage of answers given to the adjacent age classes are plotted. Finally if we consider the speakers as forming only three groups: younger speakers, adult speakers and older speakers, we can see that the overall percent of correct answers rise up to 71%, as shown in figure 4. Judgements relative to gender. The other goal of our experiment was to determine whether listeners are able to distinguish between male and female voice of unknown speaker while hearing his/her voice over the phone. Our results confirm the well established ability of listeners to distinguish between male and female voice. In fact the percentage of correct answer relatively to the gender of the speakers is 100%, which means that none of the listeners ever doubted of the gender of the speakers heard in the voice samples. Discussion The main aim of this study was the evaluation of the ability of listeners to estimate the age of unknown speakers in a subjective estimation task and the reliability of this subjective measure for forensic purposes. The estimation of speakers' age is one of the most frequent tasks in speaker profiling for forensic purposes: whenever an anonymous recording is submitted for analysis, an estimation of the speaker's age and sometimes also of some other of his/her characteristics, are requested by investigators. This usually results in a major lead in the identification and search of the person (Nolan 1996), for this reason it is important to understand the reliability of this measure. Our results show that estimates of age are generally accurate only for broad age classes and confirm the intuitive expectation that people are able to assign a general chronological age category to a voice without seeing or knowing the talker. Therefore in order to accomplish the task of age estimation in forensics, it is advisable to use general descriptions such as "young", "adult", "old" or similar, rather than the indication of age using age groups. Moreover our result show that listeners are able to correctly recognize the gender of an unknown speakers. 4 REFERENCE Braun, A., 1996. Age estimation by different listener groups, The Journal of Speech Language and the Law, vol. 3 n. 1 , pp.65-73 Cerrato, L. 1997: Sulle informazioni contenute nel SIVA, un database per la verifica e l’identificazione del parlatore, Relazione FUB 5b05197, raccolta pubblicazioni FUB 1997 p.61-70 Horii, Y., Ryan, W.J., 1981. Fundamental Frequency Characteristics and Perceived Age of Adult Male Speakers, Folia Phoniatrica, vol. 33, pp. 227-233 Kunzel, H., 1989. How well does average fundamental frequency correlate with speaker height and weight?, Phonetica, vol. 46, pp. 117-125 Lass, N.J., Phillips, J.K., Bruchey, C.A., 1980. The effect of filtered speech on speaker height and weight identification, Journal of Phonetics, vol. 8, pp. 91-100 Linville, S.E., Fisher, H.B., 1985. Acoustic characteristics of perceived versus actual vocal age in controlled phonation by adult females, JASA, vol. 78, pp. 40-48 Mullenix, J.W., Johnson, K.A., Topcu-Durgun, M., Farnsworth, L.M., 1995. The perceptual representation of voice gender, JASA, vol. 98, pp. 3080-3095 Neiman, G.S., Applegate, J.A., 1990. Accuracy of listener judgements of perceived age relative to chronological age in adults, Folia Phoniatrica, vol. 42, pp. 327-330 Nolan, F., 1996. Speaker recognition and forensic phonetics, in "The handbook of Phonetic sciences" by Hardcastle W., Laver J., pp. 744-766 Pisoni, D.B., Martin, C.S., 1989. Effects of alcohol on the acoustic phonetic properties of speech: perceptual and acoustic analyses, Alcoholism clinical and experimental research, vol. 13, pp. 577-587 Rohlfs, G., 1996. Studi e Ricerche su lingua e dialetti d' Italia, Sansoni Editore Ryan, W.J., Burk, K.W., 1974. Perceptual and Acoustic Correlates of Aging in the Speech of Males, Journal of Communication Disorder, vol. 7, pp.181-192 Shipp, T., Hollien, H., 1969. Perception of the Aging Male Voice, JSHR, vol. 12, pp. 703-710 5 Biographical notes of the authors: Loredana Cerrato received an honor degree in Linguistics by the University of Naples "Federico II" in 1992. In 1995 she was visiting scientist at the National Acoustical Laboratories (AHS) in Chatswood, Sydney, Australia. From August 1996 to April 1999 she has worked at Fondazione Ugo Bordoni in the Multimedia Communication Division of the Voice Communication Group. In March 1999 she started a Doctorate course in “Language theories and Speech technology” at the University of Bari. Her interests cover the fields of speech analysis, speech perception, sociolinguistics, experimental phonetics, forensic phonetics, speech recognition and automatic translation. The results of her research lead to the publication of 25 papers in conference proceedings and scientific journals. Mauro Falcone received the degree in Physics in 1988. He is a researcher in the Multimedia Communication Division, in the Speech Communication group at Fondazione Ugo Bordoni. His interests cover: speech quality evaluation; speech output system assessment; speaker verification/identification assessment; interactive and speech driven system assessment. He has published more than 50 works among project reports, articles in books, scientific journal, conference proceedings. Andrea Paoloni received the degree in Electronic Engineering from the University of Rome, in 1973. Since 1974 he has been working as researcher at the Fondazione Ugo Bordoni, where he is now Head of the Speech Communication Group. His research activity covers the fields of speech processing, with particular emphasis in speaker recognition and systems evaluation. He has authored 120 papers. He has been working also for the criminal justice system as an expert in speaker identification and authentication of recordings. He is now also chairman of the COST 250 project (Speaker recognition in telephony). Table 1 Percent of correspondence between chronological and perceived age Age classes CA PA CA vs. PA 18-24 21 31 68% 25-31 28 35 80% 32-38 35 38 92% 39-45 42 46 91% 46-52 49 47 96% 53-59 56 48 86% 60-66 63 49 78% 84% (mean) 6 Table 2 Summary of percent correspondence between chronological and perceived age in previous studies (*Braun used only male voices) Authors year result Neiman, Applegate 1990 80.2% Braun*-(expert) 1996 85.5% Braun* (non expert) 1996 84.1% this study 1998 84.4% Table 3 Summary of Pearson correlation coefficient established in previous studies Authors year Pearson coefficient Ship, Hollien 1969 0.88 Ryan, Burk 1974 0.77 Horii, Ryan 1981 0.76 Neiman, Applegate 1990 0.88 Braun (expert) 1996 0.70 Braun (non expert) 1996 0.68 this study 1998 0.77 80 70 60 50 40 30 20 10 0 18-24 25-31 32-38 39-45 46-52 53-59 60-66 Fig. 1 Perceived age and its error. Triangles stand for mean of perceived age, circles represent the central value of the age class. 7 50 40 30 20 10 0 18-24 25-31 32-38 39-45 46-52 53-59 60-66 Fig. 2 Percentage of correct estimates for age classes 50 50 40 40 30 30 20 20 10 10 0 0 21 28 35 42 49 56 63 Figure 2a – Age class 18-24 21 28 35 42 49 56 63 49 56 63 Figure 2b – Age class 25-31 50 50 40 40 30 30 20 20 10 10 0 0 21 28 35 42 Figure 2c – Age class 32-38 49 56 63 21 28 35 42 Figure 2d – Age class 39-45 8 50 50 40 40 30 30 20 20 10 10 0 0 21 28 35 42 49 56 63 21 Figure 2e – Age class 46-52 28 35 42 49 56 63 Figure 2f – Age class 53-59 50 Fig.2a-g In these figures the distribution of answers for each of the seven classes is reported. On the abscissa there is the central value of the age class, on the ordinate the percentage of answers given by listeners. The dot on the top of each figure indicates the correct chronological age. 40 30 20 10 0 21 28 35 42 49 56 63 Figure 2g – Age class 60-66 90 80 70 60 50 40 30 20 10 0 18-24 25-31 32-38 39-45 46-52 53-59 60-66 Fig. 3 Percentage of correct estimation (black) and percentage of answers given to adjacent classes (gray). 9 100 80 60 40 20 0 18-38 "young" 32-52 "adult" 45-66 "older" Fig. 4 Percentage of correct answers for broad age classes. 10