Dept. for Speech, Music and Hearing Quarterly Progress and Status Report Statistical analysis of speech signals Blomberg, M. and Elenius, K. O. E. journal: volume: number: year: pages: STL-QPSR 11 4 1970 001-008 http://www.speech.kth.se/qpsr STL-QPSR 4/1970 I, A' SPEECH ANALYSIS STATISTICAL ANALYSIS O F SPEECH SIGNALS lif, Blomberg and I<. E l e n i u s This i s a condensed r e p o r t of a t h e s i s study c a r r i e d out a t the D e p a r t m e n t of Speech Communication in 1970, The purpose was to d e t e r m i n e , f o r continuous speech, peakfactor, f o r m f a c t o r , long-time a v e r a g e s p c c t r u m of voiced and v o i c e l e s s sections s e p a r a t e l y , s p e c t r a l density at diff e r e n t voice intensity l e v e l s r d i s t r i b ~ l t i o nof the speech-wave amplitude, s t a t i s t i c s on pause lengths and long-time a v e r a g e RMS of the speech wave. All t a s k s have been solved using the CDC computer of the Department. Speech m a t e r i a l Sixteen 1-min long sections of conversational speech w e r e r e c o r d e d with the p a r t i c i p a n t s on s e p a r a t e channels of a twin-traclr t a p e - r e c o r d c r . The recording w a s m a d e by the Swedish Board of Telecommunications ( T e l e verket), F a r s t a . A limitation of this m a t e r i a l was the relative low d e g r e e of c r o s s - c h a n n e l attenuation, 25-30 dB. This distortion was found to im- p a i r the s t a t i s t i c s on pause lengths only. F o r this p a r t i c u l a r study we r e - corded another m a t e r i a l with higher intcr channel attenuation. Distribution of s ~ e e c h - w a v ea m ~ l i t u d c The amplitude distribution of speech was m e a s u r e d with the equipment shown in Fig. I-A- 1. The computer s a m p l e s in r e a l time the rectified and s h o r t - time (13 m s e c ) integrated speech wave and a s s e m b l e s the distribution by adding 1 to the word in s t o r a g e that c o r r e s p o n d s to the quantized l e v e l of the sample. of 2 dB. The distribution i s displayed a s a h i s t o g r a m with a resolution Fig. I-A-2 shovrs a typical h i s t o g r a m of onc of the s a m p l e s of con- v e r s a t i o n a l specch, e a r l i e r described. The amplitude of the curve i s a m e a s u r e of the probability that the signal level will have the value given by the x - a x i s . The c u r v e shows two distinct maxima. The one f o r higher signal l e v e l s depends on the s p e a k e r that i s to be m e a s u r e d . The oilc f o r lower l e v e l s i s due to noise and the voice of the s p e a k e r on the o t h c r channel. The distance between the m a x i m a gives a m e a s u r e of the signal-to-noise r a t i o , o r a s in t h i s c a s e , the attenuation between the channels which evidently i s 25-30 dB. An exact m e a s u r e i s not possible, a s thc long-time RMS of the s p e a k e r s car,not be d e r i v e d f r o m the histogram. b 7 RECOW DER a ' k LOG, RECT. INTEGR. 'ti=13ms CONV. > COMPUTER CDC 1700 A Fig. I-A- I . The equipment used to measure the distribution of the speech-wave amplitude. SPEAKER K 2 Fig. I-A-2. Histogram of the distribution of the speech-wave amplitude. Conversational speech. STL-QPSR 4/1970 F o r a p p r o p r i a t e m e a s u r e m e n t s on pause lengths, it i s of importance where the l e v e l of the pause threshold i s situated. It should be s e t a s low as possible without allowing noise to excced the threshold. Fig. I-A-3 shows a n amplitude h i s t o g r a m where only one spcal-,er i s talking. The noise den- sity m a x i m u m i s about 30 dB under the speech density maximum. The choice of the pause threshold l e v e l should be ca 15-20 dB undcr the speech density maximuizl, \*/l~erc the h i s t o g r a m h a s a minimum. length i s then l e a s t dependent of the threshold level. The total pause This valuc: i s about 35 dB l e s s than the highest spcech l e v e l during the m e a s u r e d interval. P a u s e -length s t a t i s t i c s T h e r e a r c two c s sential p a r a m e t e r s that define the acoustic boundaries of a p a u s e , namely, maximum amplitude within the pause and i t s minimum duration. The f o r m e r puts an upper l i x i t to the range of speech-wave ampli- tudes which might c o m p r i s e a pause. The l a t t e r fixe s the s m a l l e s t amount of t i m e during which thcse acoustic amplitudes m u s t flow in o r d e r to continue a pause. O'iherwise inter-wave " g a p s t t , that i s , m o m e n t s of relatively l e s s amplitude that usually a r e p r e s e n t in a complex speech wave, might be d e tected a s pauses. The duration of the inter-wave "gap" i s a v e r y s h o r t one and depends on the fundamental frequency, the p a r t i c u l a r shape of the wave and the l e v e l of the pause-maximum amplitude. It i s reasonable to s t a t e that f o r complex waves of fundamental frequency of 100 Hz the duration of the inter-wave "gaps" i s n e c e s s a r i l y s h o r t e r than the period, 10 m s e c . P a u s e s , o r m o s t p a u s e s , l a s t m o r e than these durations((). r a t e of the rectified spcech wave was I600 Hz. The sampling This should be enough to exclude i n t e r -wave "gaps". Automatic pause-length a n a l y s i s i s of c o u r s e effected by clicks, gurgitations, a i r - s t r e a m noise, and other a s s o r t e d sounds. These sounds will s o m e - tiriles excced the amplitude threshold and v ~ i l l ,a t w o r s t , ripple just about the threshold level causing s e v e r a l s h o r t p a u s e s (usually 10 to 30 m s e c ) in a speech segment, which otherwise would be considered a s being one pause, since i t adds nothing to the speech intelligibility. R a i s i n g the p a u s e - m a x i m u m amplitude to eliminate t h e s e effects will cause the mealcest consonants to be considered a s pauses. However, f o r pausc lengths longer than approximately 30 m s e c the pause s t a t i s t i c s m o s t probably differ little f r o m those m e a s u r e d with the effects above eliminated. SPEAKER I.K. Fig. I-A-3. Histogram of the distribution of the speech-wave amplitude. Only one speaker. 0 0 100 Fig. I-A-4. Pause-length statistics. 0200 12.5 - 250 msec. sec I 0 * I I 02 Fig. I-A-5. I .4 I I I .8 -6 Pause-length statistics. I 12.5 - 1000 msec. I sec STL-QPSR 4/1970 Long-time m e a n value The long-time m e a n value of the rectified speech wave (LMV) was m e a s u r e d by averaging the rectified and integrated specch signal. Together with thc f o r m f a c t o r ( f ) it will give the long-time a v e r a g e f = RMS (LRMS), since LRMS Li\AV In o r d e r to m e a s u r e the LMV f o r speech only and not f o r a l l p a u s e s , the following c r i t e r i o n h a s been used. If the level of thc rectified and integrated speech wave i s lower than a given threshold f o r a s h o r t e r t i m e than 100 m s e c i t will be used in the calculation of LMV, while p a u s e s longer than 100 m s e c will be neglected. The boundary was chosen a t 100 m s e c a f t e r studying oc- clusion lengths in continuous speech. p a r a m e t e r was not c r i t i c a l . L a t e r r n e a s u r e m c n t s showed that t h i s The s a m e r e s u l t s w e r e obtained using the boundaries 50 m s e c and 500 m s e c . Likewise the o t h e r p a r a m e t e r , the threshold level, was not c r i t i c a l e i t h e r . Changing it by 20 dB did not ef - f e c t the r e s u l t which, however, was quantized in s t e p s of 2 dB. The value of LRMS was l a t e r u s e 6 a s a r e f e r e n c e l e v e l when a speech controlled, loudspeaking telephone s y s t e m was simulated by the computer. The r e s u l t s of t h i s investigation will be d i s c u s s e d in a forthcoming r e p o r t . Formfactor The f o r m f a c t o r was derived in the following The speech signal w a s sampled and digitalized. In r e a l time, the computer accumulates s a m p l e s 2 2nd c s l c u l a t e s the s u m s C xn and C xn. The f o r m f a c t o r i s then computed x n f r o m the equation: T1!ith this definition the f o r m f a c t o r of a sinc wave will be 1 dB. P a u s e s in speech w e r e t r e a t e d the s a m e way a s when calculating LMV ( s e e above). The sampling frequency u s e d was 10 kHz. securing a complete waveform reconstruction. This i s somewhat too low f o r However, it can be shown ( 6 ) that the f o r m f a c t o r can be derived f r o m the s t a t i s t i c a l distribution of the signal level. The problem i s then reduced to choosing a sampling density that allows a sufficiently good approximation to the distribution. dition i s e a s i e r to meet. This con- STL-QPSR 4/19'70 5. M e a s u r e m e n t s on conversational speech of 1 m i n length varying the sampling frequency between 50 and 10.000 H z have verified this. See Table I-A- 1. Table I-A- I. F o r m f a c t o r a t different sampling frequencies. The f o r m f a c t o r of e a c h s p e a k c r was determined. The r e s u l t s a r c shown in Table I-A-2. r 1 Sampling frequency 50 500 5000 Formfactor 1.4 1.6 1.6 10000 Hz 1.6 6 Table I-A-2. / I Distribution of f o r m f a c t o r s 1 Formfactor Number of m a l e subjects " "female I / 3dBI 4dB 1 1 7 1 2 1 6 13 / 0 I ,I i It Total number 5dB , 1 & The f o r m f a c t o r v a r i e s between 3 and 5 d 3 with a m e a n value of about 4 dB. This c o r r c s p o n d s well to measurements by ant'^), who found a value of 3 . 5 dB. Peakfnctor The peakfactor, defined a s l a r g e s t value during the o b s e r v e d t i m e = RMS - 11 h a s been m e a s u r e d f o r 1 m i n long p a r t s of speech and 10 s e c long sections of the s a m e speech. The r e s u l t s give a peakfactor of 16.5 f 1.0 dB f o r the 1 m i n long p a r t s , while the corresponding value f o r the sections of 10 s e c was 17.0 +- 1.5 dB. The s m a l l difference indicates that the peakfactor d o e s not v a r y much with durations longer than 10 seconds. =ant(') and ~ ~ 6 h o l mhave ( ~ ) m c a s u r c d a pealcfactor of 17 dB, which conf o r m s quite well with these r e sults. STL-QPSR 4/1970 Long-time a v e r a g e speech s p e c t r u m The 5 1-channel filterbank w a s u s e d f o r mea s u r e m e n t s of long- t i m e a v e r a g e speech s p e c t r u m , including s e p a r a t e s t a t i s t i c s f o r voiced and v o i c e l e s s se gmcnts. The computer p r o g r a m involved a calculation of thc: m e a n value of sampled s p c c t r u m sections. The total range of the filterbank was 0 -9.6 kHz. Constant f i l t e r band- widths of B = 250 Hz and a m e l - s c a l e frequency display was choscx. The detection of voiced and voiceleas segments of speech employed one channel with n o r m a l high-frequency p r e - e m p h a s i s (6 d ~ / o c t , 200-5000 HZ) and one with l o w - p a s s filtering (330 Hz) of the signal, s e e F i g . I-A-6. Sound i s detected when the output level of the p r e - e i n p h a s i z e d signal is higher than a threshold l e v e l , located c a 30 dB bciow the m a x i m u m intensity during the m e a s u r e d interval. This i s not exactly comparable, though, to the pause-maximum amplitude level mentioned above, a s the weaker v o i c c l e s s f r i c a t i v e s will be s t r o n g e r because of the p r e - e m p h a s i s . If the output of the LP - f i l t e r e d signal i s higher r e s p . lower than a n adjusted level, the sound i s judged a s voiced r e s p . voiceless. TTThenthe condition f o r the sound c l a s s that i s to be m e a s u r e d a r e m e t , the f i l t e r s of the filterbanlc a r e synchronousl y sampled, and the s p e c t r u m sections a r e added to a n accumulative a r e a in the computer s t o r a g e to f o r m the a v e r a g e s p c c t r u m . lMaximum sampling r a t e i s about 160 Hz. The effects of the low c r o s s - c h a n n e l attenuation w e r e eliminated by m e a n s of sound detection o:l the o t h e r channel (channel 2 ) than the one m e a s ured. V h c n sound i s detected no m e a s u r e m e n t i s made. Otherwise, voiced segments of the spealcer on this channel might be detected a s v o i c e l e s s on channel 1 and give a wrong a v e r a g e s p e c t r u m of v o i c e l e s s segments. Average s p e c t r a with no s e p a r a t i o n voiced-voiceless f o r one m a l e and one female s p e a k e r that have been computed by t h i s method a r e shown in Fig. I-A-7 and I-A-8. They a r e c o m p a r e d with s p e c t r a a f t e r Dunn & White (7 and F r e n c h & steinberg('). T h e r e i s no g r e a t difference to be found. Separate s p e c t r a of voiced and v o i c c l e s s s e g m e n t s of the s a m e s p e a k e r s a r e shown in F i g s . I-A-9 and I-A-10. Voiced sounds, being m o r e frcqucnt and s t r o n g e r than the unvoiced, dominate the s p e c t r u m of the total speech wave. Especially, t h i s i s the c a s e f o r f r e q u e n c i e s below 3 kHz. Over 3 kHz s p e c t r u m f a l l s m o r e rapidly. -------.-.-.- Fig. I-A-7. MEASURED SPECTRUM. M I AVERAGE SPECTRUM AFTER DUNN & WHITE (MALE) IDEALIZED AVERAGE SPECTRUM AFTER FRENCH & STEINBERG I I I I 3 4 6 8 kHz Long-time average spectrum. Male subject, M I . 40 - MEASURED SPECTRUM, K4 Fig. I-A-8. I I I I 3 4 6 8 kHz Long-time average spectrum. Female subject, K4. - --.-.. '7 SPEAKER - Fig. I-A- 10. K4 I I I I 3 4 6 8 Long-time average spectrum. Separation voiced-voiceles~. Voiced spectrum attenuated 10 dB. Female subject, K4. kHz The v o i c e l e s s sounds, yvhich a r e c h a r a c t e r i z e d by l a c k of glottal pulses, include not only v o i c c l c s s f r i c a t i v e s but a l s o a s p i r a t i v e s , b u r s t s a f t e r stops, and, m o r e o r l e s s d e s i r a b l e , brca'hing sounds before sentences. These sounds and d a r k / s j / will contrihutc to a s p e c t r u m shape deviating f r o m the expected high-frequency e m p h a s i s . F i g . I-A- 1 I shows a n a v e r a g e s p e c t r u m f o r the a s p i r a t i v e , whispered and o 1' The e n e r g y i s mainly concentrated below 2 kHz. This finding j s in a g r e e m e n t with the a v e r a g e s p e c - vowels a I , O1, U i ~ i!, Y i , a3 9 t r u m of v o i c e l e s s sounds In this r e s i o n , because the a s p i r a t i v e s constitute a g r e a t p a r t of these. S p e c t r a of voiceless sounds have a n e n e r g y minimum between 3 and 4 kHz. Most probably, this depends on the fact that neithcr the a s p i r a t i v e s n o r the f r i c a t i v e s contain any l a r g e r amount of c n e r g y a t t h i s frequency. The f r i c a - t i v e s , where / s / i s the m o , r t f r c c j ~ ~ c have ~ t , the m a i n p a r t of t h e i r e n e r g y above 4 kHz and the a ~ p i ~ a t i v belaw cs 2 kHz. Average s p e c t r u m a t different voice intensities Two subjects made a 2 0 - s e c long u t t e r a n c e , the s a m e f o r both, 4 t i m e s , varying the voice intensity in 10 clB s t e p s between speech a s m e a s u r e d by a VU-meter. l e s s separation were measured. - 10 and f 2 0 dB r e l . n o r m a l Average s p e c t r a without voiced-voice- The r e c o r d i n g and playback l e v e l s w e r e adjusted s o that low-voicc effort would give about the s a m e output signal a s high. This was done in o r d e r to get the s a m e p r e c i s i o n f o r a l l intensities. The s p e c t r a a r e shown in Fig. I-A- 12 and I-A- 13. High-voice effort differs f r o m iow rilainly by the s p e c t r u m slope below 1 kHz. J. S . , the intensity f a l l r f r o m 300 to 1000 Hz ca 18 dB f o r sity and only c a 5 dB f o r t 2 0 dB. a r e 35 r e s p . 10 dB. F o r speaker - 10 dB voice inten- The corresponding values f o r s p e a k e r K.E. This effect i s due to the voice source. At low-voice effort, s o u r c e s p e c t r u m f a l l s m o r e rapidly in this frequency region('). The level in the fundamental frcqucncy region j.s higher f o r low-voice intensity, which i s a n a p p a r e n t f e a t u r e of the f i g u r e s . A combining effect i s a l s o the i n c r e a s e of F O a t higher voice intensities. F o r medium and h i g h - v ~ i c eeffort, t h e r e i s a distinct peak between 3 and 4 kHz. This i s the s o - c a l l e d "singing f o r m a n t t ' . It i s r e l a t e d to the simcls p i r i f o r m i s , a cavity around the upper p a r t of larynx. It i s e n l a r g e d when the SPEAKER K.E. Fig. I-A- II. Average spectrum. Whispered vowels. I I 6 8 kHz -. ...- -#)dB ------. _ . _ REL. NORMAL SPEECH LEVEL -"OdB *1OdB " - -- *2OdB *a SPEAKER J.S. 10 # I 2 Fig. I-A- 12. 4 Average spectrum at different voice effort. 6 8 kHz Male speaker J. S. -.. -..-..- -------- SPEECH LEVEL -10dB REL. NORMAL OdB 8 *10dB 8- +2OdB I .3.-.- I S - - SPEAKER K.E. 8 kHz Fig. I-A-13. Average spectrum at different voice effort. Male speaker K. E. C. STL-QPSR 4/i970 voice intensity i n c r c a s c s , which probably c a u s c s ncw for=ants to be f o r m c d compared with n o r m a l spccch. A marc thorough explanation i s m a d e by J . Sundbcrg in this i s s u c of STL-QPSR. Rcfcrcnccs: ( i ) G. F a n t : "Acoustic Analysis and Synthesis of Spcech with Applications to Swedish", E r i c s s o n Tcchnics, No. 1 (1959). (2) G. F a n t : "Spccch Analysis and Synthc sisI1, R s y z l Llstitutc of Tc chnology, Speech Transrilission L a b o r a t o r y R c p o r t No. 26(1962), pp. 24-25. (3) G. F a n t : "Undersokning a v 10 scli s t a n d a r d f r a s " , L M E r i c s s o n p r o t o l ~ o l ln r H/P- 105 1. (4) 0. Tosi: "Detection and M e a s u r c m c n t of P a u s c s in Spcech", A m e r i c a n Spccch and Hearing Association ( 1966). (5) C. SjZSholm: "Arxplitudfordelning f o r tallt, T c l c v c r k c t s Provnings a n s t a l t protokoll n r 243/67. (6) hl. Blomberg and K. Elenius: "Statistisk analys zv taisignaler", T h c s i s v~orl:a t the Departmcnt of Spccch Communication, liTH, Stoclcholm (1970). (7) H. IS. Dunn and S. D. Vhitc: "Statistical Measurements on C o n v c r s a tional Speech1', J . Acoust. Soc. Am. 11 (1940), pp. 278-288. (8) N. R. F r e n c h and J . C. Stcinberg: " F a c t o r s Governing thc Intclligibility of Spc c c h Sounds", J. ~ c o u s t : ~ o c~.1 1 1 . 19 ( 1 9 4 7 ) ~pp. 90 - 119.