Statistical analysis of speech signals

Dept. for Speech, Music and Hearing Quarterly Progress and Status Report Statistical analysis of speech signals Blomberg, M. and Elenius, K. O. E. journal: volume: number: year: pages: STL-QPSR 11 4 1970 001-008 http://www.speech.kth.se/qpsr STL-QPSR 4/1970 I, A' SPEECH ANALYSIS STATISTICAL ANALYSIS O F SPEECH SIGNALS lif, Blomberg and I<. E l e n i u s This i s a condensed r e p o r t of a t h e s i s study c a r r i e d out a t the D e p a r t m e n t of Speech Communication in 1970, The purpose was to d e t e r m i n e , f o r continuous speech, peakfactor, f o r m f a c t o r , long-time a v e r a g e s p c c t r u m of voiced and v o i c e l e s s sections s e p a r a t e l y , s p e c t r a l density at diff e r e n t voice intensity l e v e l s r d i s t r i b ~ l t i o nof the speech-wave amplitude, s t a t i s t i c s on pause lengths and long-time a v e r a g e RMS of the speech wave. All t a s k s have been solved using the CDC computer of the Department. Speech m a t e r i a l Sixteen 1-min long sections of conversational speech w e r e r e c o r d e d with the p a r t i c i p a n t s on s e p a r a t e channels of a twin-traclr t a p e - r e c o r d c r . The recording w a s m a d e by the Swedish Board of Telecommunications ( T e l e verket), F a r s t a . A limitation of this m a t e r i a l was the relative low d e g r e e of c r o s s - c h a n n e l attenuation, 25-30 dB. This distortion was found to im- p a i r the s t a t i s t i c s on pause lengths only. F o r this p a r t i c u l a r study we r e - corded another m a t e r i a l with higher intcr channel attenuation. Distribution of s ~ e e c h - w a v ea m ~ l i t u d c The amplitude distribution of speech was m e a s u r e d with the equipment shown in Fig. I-A- 1. The computer s a m p l e s in r e a l time the rectified and s h o r t - time (13 m s e c ) integrated speech wave and a s s e m b l e s the distribution by adding 1 to the word in s t o r a g e that c o r r e s p o n d s to the quantized l e v e l of the sample. of 2 dB. The distribution i s displayed a s a h i s t o g r a m with a resolution Fig. I-A-2 shovrs a typical h i s t o g r a m of onc of the s a m p l e s of con- v e r s a t i o n a l specch, e a r l i e r described. The amplitude of the curve i s a m e a s u r e of the probability that the signal level will have the value given by the x - a x i s . The c u r v e shows two distinct maxima. The one f o r higher signal l e v e l s depends on the s p e a k e r that i s to be m e a s u r e d . The oilc f o r lower l e v e l s i s due to noise and the voice of the s p e a k e r on the o t h c r channel. The distance between the m a x i m a gives a m e a s u r e of the signal-to-noise r a t i o , o r a s in t h i s c a s e , the attenuation between the channels which evidently i s 25-30 dB. An exact m e a s u r e i s not possible, a s thc long-time RMS of the s p e a k e r s car,not be d e r i v e d f r o m the histogram. b 7 RECOW DER a ' k LOG, RECT. INTEGR. 'ti=13ms CONV. > COMPUTER CDC 1700 A Fig. I-A- I . The equipment used to measure the distribution of the speech-wave amplitude. SPEAKER K 2 Fig. I-A-2. Histogram of the distribution of the speech-wave amplitude. Conversational speech. STL-QPSR 4/1970 F o r a p p r o p r i a t e m e a s u r e m e n t s on pause lengths, it i s of importance where the l e v e l of the pause threshold i s situated. It should be s e t a s low as possible without allowing noise to excced the threshold. Fig. I-A-3 shows a n amplitude h i s t o g r a m where only one spcal-,er i s talking. The noise density m a x i m u m i s about 30 dB under the speech density maximum. The choice of the pause threshold l e v e l should be ca 15-20 dB undcr the speech density maximuizl, \*/l~erc the h i s t o g r a m h a s a minimum. length i s then l e a s t dependent of the threshold level. The total pause This valuc: i s about 35 dB l e s s than the highest spcech l e v e l during the m e a s u r e d interval. P a u s e -length s t a t i s t i c s T h e r e a r c two c s sential p a r a m e t e r s that define the acoustic boundaries of a p a u s e , namely, maximum amplitude within the pause and i t s minimum duration. The f o r m e r puts an upper l i x i t to the range of speech-wave amplitudes which might c o m p r i s e a pause. The l a t t e r fixe s the s m a l l e s t amount of t i m e during which thcse acoustic amplitudes m u s t flow in o r d e r to continue a pause. O'iherwise inter-wave " g a p s t t , that i s , m o m e n t s of relatively l e s s amplitude that usually a r e p r e s e n t in a complex speech wave, might be d e tected a s pauses. The duration of the inter-wave "gap" i s a v e r y s h o r t one and depends on the fundamental frequency, the p a r t i c u l a r shape of the wave and the l e v e l of the pause-maximum amplitude. It i s reasonable to s t a t e that f o r complex waves of fundamental frequency of 100 Hz the duration of the inter-wave "gaps" i s n e c e s s a r i l y s h o r t e r than the period, 10 m s e c . P a u s e s , o r m o s t p a u s e s , l a s t m o r e than these durations((). r a t e of the rectified spcech wave was I600 Hz. The sampling This should be enough to exclude i n t e r -wave "gaps". Automatic pause-length a n a l y s i s i s of c o u r s e effected by clicks, gurgitations, a i r - s t r e a m noise, and other a s s o r t e d sounds. These sounds will s o m e - tiriles excced the amplitude threshold and v ~ i l l ,a t w o r s t , ripple just about the threshold level causing s e v e r a l s h o r t p a u s e s (usually 10 to 30 m s e c ) in a speech segment, which otherwise would be considered a s being one pause, since i t adds nothing to the speech intelligibility. R a i s i n g the p a u s e - m a x i m u m amplitude to eliminate t h e s e effects will cause the mealcest consonants to be considered a s pauses. However, f o r pausc lengths longer than approximately 30 m s e c the pause s t a t i s t i c s m o s t probably differ little f r o m those m e a s u r e d with the effects above eliminated. SPEAKER I.K. Fig. I-A-3. Histogram of the distribution of the speech-wave amplitude. Only one speaker. 0 0 100 Fig. I-A-4. Pause-length statistics. 0200 12.5 - 250 msec. sec I 0 * I I 02 Fig. I-A-5. I .4 I I I .8 -6 Pause-length statistics. I 12.5 - 1000 msec. I sec STL-QPSR 4/1970 Long-time m e a n value The long-time m e a n value of the rectified speech wave (LMV) was m e a s u r e d by averaging the rectified and integrated specch signal. Together with thc f o r m f a c t o r ( f ) it will give the long-time a v e r a g e f = RMS (LRMS), since LRMS Li\AV In o r d e r to m e a s u r e the LMV f o r speech only and not f o r a l l p a u s e s , the following c r i t e r i o n h a s been used. If the level of thc rectified and integrated speech wave i s lower than a given threshold f o r a s h o r t e r t i m e than 100 m s e c i t will be used in the calculation of LMV, while p a u s e s longer than 100 m s e c will be neglected. The boundary was chosen a t 100 m s e c a f t e r studying oc- clusion lengths in continuous speech. p a r a m e t e r was not c r i t i c a l . L a t e r r n e a s u r e m c n t s showed that t h i s The s a m e r e s u l t s w e r e obtained using the boundaries 50 m s e c and 500 m s e c . Likewise the o t h e r p a r a m e t e r , the threshold level, was not c r i t i c a l e i t h e r . Changing it by 20 dB did not ef - f e c t the r e s u l t which, however, was quantized in s t e p s of 2 dB. The value of LRMS was l a t e r u s e 6 a s a r e f e r e n c e l e v e l when a speech controlled, loudspeaking telephone s y s t e m was simulated by the computer. The r e s u l t s of t h i s investigation will be d i s c u s s e d in a forthcoming r e p o r t . Formfactor The f o r m f a c t o r was derived in the following The speech signal w a s sampled and digitalized. In r e a l time, the computer accumulates s a m p l e s 2 2nd c s l c u l a t e s the s u m s C xn and C xn. The f o r m f a c t o r i s then computed x n f r o m the equation: T1!ith this definition the f o r m f a c t o r of a sinc wave will be 1 dB. P a u s e s in speech w e r e t r e a t e d the s a m e way a s when calculating LMV ( s e e above). The sampling frequency u s e d was 10 kHz. securing a complete waveform reconstruction. This i s somewhat too low f o r However, it can be shown ( 6 ) that the f o r m f a c t o r can be derived f r o m the s t a t i s t i c a l distribution of the signal level. The problem i s then reduced to choosing a sampling density that allows a sufficiently good approximation to the distribution. dition i s e a s i e r to meet. This con- STL-QPSR 4/19'70 5. M e a s u r e m e n t s on conversational speech of 1 m i n length varying the sampling frequency between 50 and 10.000 H z have verified this. See Table I-A- 1. Table I-A- I. F o r m f a c t o r a t different sampling frequencies. The f o r m f a c t o r of e a c h s p e a k c r was determined. The r e s u l t s a r c shown in Table I-A-2. r 1 Sampling frequency 50 500 5000 Formfactor 1.4 1.6 1.6 10000 Hz 1.6 6 Table I-A-2. / I Distribution of f o r m f a c t o r s 1 Formfactor Number of m a l e subjects " "female I / 3dBI 4dB 1 1 7 1 2 1 6 13 / 0 I ,I i It Total number 5dB , 1 & The f o r m f a c t o r v a r i e s between 3 and 5 d 3 with a m e a n value of about 4 dB. This c o r r c s p o n d s well to measurements by ant'^), who found a value of 3 . 5 dB. Peakfnctor The peakfactor, defined a s l a r g e s t value during the o b s e r v e d t i m e = RMS - 11 h a s been m e a s u r e d f o r 1 m i n long p a r t s of speech and 10 s e c long sections of the s a m e speech. The r e s u l t s give a peakfactor of 16.5 f 1.0 dB f o r the 1 m i n long p a r t s , while the corresponding value f o r the sections of 10 s e c was 17.0 +- 1.5 dB. The s m a l l difference indicates that the peakfactor d o e s not v a r y much with durations longer than 10 seconds. =ant(') and ~ ~ 6 h o l mhave ( ~ ) m c a s u r c d a pealcfactor of 17 dB, which conf o r m s quite well with these r e sults. STL-QPSR 4/1970 Long-time a v e r a g e speech s p e c t r u m The 5 1-channel filterbank w a s u s e d f o r mea s u r e m e n t s of long- t i m e a v e r a g e speech s p e c t r u m , including s e p a r a t e s t a t i s t i c s f o r voiced and v o i c e l e s s se gmcnts. The computer p r o g r a m involved a calculation of thc: m e a n value of sampled s p c c t r u m sections. The total range of the filterbank was 0 -9.6 kHz. Constant f i l t e r band- widths of B = 250 Hz and a m e l - s c a l e frequency display was choscx. The detection of voiced and voiceleas segments of speech employed one channel with n o r m a l high-frequency p r e - e m p h a s i s (6 d ~ / o c t , 200-5000 HZ) and one with l o w - p a s s filtering (330 Hz) of the signal, s e e F i g . I-A-6. Sound i s detected when the output level of the p r e - e i n p h a s i z e d signal is higher than a threshold l e v e l , located c a 30 dB bciow the m a x i m u m intensity during the m e a s u r e d interval. This i s not exactly comparable, though, to the pause-maximum amplitude level mentioned above, a s the weaker v o i c c l e s s f r i c a t i v e s will be s t r o n g e r because of the p r e - e m p h a s i s . If the output of the LP - f i l t e r e d signal i s higher r e s p . lower than a n adjusted level, the sound i s judged a s voiced r e s p . voiceless. TTThenthe condition f o r the sound c l a s s that i s to be m e a s u r e d a r e m e t , the f i l t e r s of the filterbanlc a r e synchronousl y sampled, and the s p e c t r u m sections a r e added to a n accumulative a r e a in the computer s t o r a g e to f o r m the a v e r a g e s p c c t r u m . lMaximum sampling r a t e i s about 160 Hz. The effects of the low c r o s s - c h a n n e l attenuation w e r e eliminated by m e a n s of sound detection o:l the o t h e r channel (channel 2 ) than the one m e a s ured. V h c n sound i s detected no m e a s u r e m e n t i s made. Otherwise, voiced segments of the spealcer on this channel might be detected a s v o i c e l e s s on channel 1 and give a wrong a v e r a g e s p e c t r u m of v o i c e l e s s segments. Average s p e c t r a with no s e p a r a t i o n voiced-voiceless f o r one m a l e and one female s p e a k e r that have been computed by t h i s method a r e shown in Fig. I-A-7 and I-A-8. They a r e c o m p a r e d with s p e c t r a a f t e r Dunn & White (7 and F r e n c h & steinberg('). T h e r e i s no g r e a t difference to be found. Separate s p e c t r a of voiced and v o i c c l e s s s e g m e n t s of the s a m e s p e a k e r s a r e shown in F i g s . I-A-9 and I-A-10. Voiced sounds, being m o r e frcqucnt and s t r o n g e r than the unvoiced, dominate the s p e c t r u m of the total speech wave. Especially, t h i s i s the c a s e f o r f r e q u e n c i e s below 3 kHz. Over 3 kHz s p e c t r u m f a l l s m o r e rapidly. -------.-.-.- Fig. I-A-7. MEASURED SPECTRUM. M I AVERAGE SPECTRUM AFTER DUNN & WHITE (MALE) IDEALIZED AVERAGE SPECTRUM AFTER FRENCH & STEINBERG I I I I 3 4 6 8 kHz Long-time average spectrum. Male subject, M I . 40 - MEASURED SPECTRUM, K4 Fig. I-A-8. I I I I 3 4 6 8 kHz Long-time average spectrum. Female subject, K4. - --.-.. '7 SPEAKER - Fig. I-A- 10. K4 I I I I 3 4 6 8 Long-time average spectrum. Separation voiced-voiceles~. Voiced spectrum attenuated 10 dB. Female subject, K4. kHz The v o i c e l e s s sounds, yvhich a r e c h a r a c t e r i z e d by l a c k of glottal pulses, include not only v o i c c l c s s f r i c a t i v e s but a l s o a s p i r a t i v e s , b u r s t s a f t e r stops, and, m o r e o r l e s s d e s i r a b l e , brca'hing sounds before sentences. These sounds and d a r k / s j / will contrihutc to a s p e c t r u m shape deviating f r o m the expected high-frequency e m p h a s i s . F i g . I-A- 1 I shows a n a v e r a g e s p e c t r u m f o r the a s p i r a t i v e , whispered and o 1' The e n e r g y i s mainly concentrated below 2 kHz. This finding j s in a g r e e m e n t with the a v e r a g e s p e c - vowels a I , O1, U i ~ i!, Y i , a3 9 t r u m of v o i c e l e s s sounds In this r e s i o n , because the a s p i r a t i v e s constitute a g r e a t p a r t of these. S p e c t r a of voiceless sounds have a n e n e r g y minimum between 3 and 4 kHz. Most probably, this depends on the fact that neithcr the a s p i r a t i v e s n o r the f r i c a t i v e s contain any l a r g e r amount of c n e r g y a t t h i s frequency. The f r i c a - t i v e s , where / s / i s the m o , r t f r c c j ~ ~ c have ~ t , the m a i n p a r t of t h e i r e n e r g y above 4 kHz and the a ~ p i ~ a t i v belaw cs 2 kHz. Average s p e c t r u m a t different voice intensities Two subjects made a 2 0 - s e c long u t t e r a n c e , the s a m e f o r both, 4 t i m e s , varying the voice intensity in 10 clB s t e p s between speech a s m e a s u r e d by a VU-meter. l e s s separation were measured. - 10 and f 2 0 dB r e l . n o r m a l Average s p e c t r a without voiced-voice- The r e c o r d i n g and playback l e v e l s w e r e adjusted s o that low-voicc effort would give about the s a m e output signal a s high. This was done in o r d e r to get the s a m e p r e c i s i o n f o r a l l intensities. The s p e c t r a a r e shown in Fig. I-A- 12 and I-A- 13. High-voice effort differs f r o m iow rilainly by the s p e c t r u m slope below 1 kHz. J. S . , the intensity f a l l r f r o m 300 to 1000 Hz ca 18 dB f o r sity and only c a 5 dB f o r t 2 0 dB. a r e 35 r e s p . 10 dB. F o r speaker - 10 dB voice inten- The corresponding values f o r s p e a k e r K.E. This effect i s due to the voice source. At low-voice effort, s o u r c e s p e c t r u m f a l l s m o r e rapidly in this frequency region('). The level in the fundamental frcqucncy region j.s higher f o r low-voice intensity, which i s a n a p p a r e n t f e a t u r e of the f i g u r e s . A combining effect i s a l s o the i n c r e a s e of F O a t higher voice intensities. F o r medium and h i g h - v ~ i c eeffort, t h e r e i s a distinct peak between 3 and 4 kHz. This i s the s o - c a l l e d "singing f o r m a n t t ' . It i s r e l a t e d to the simcls p i r i f o r m i s , a cavity around the upper p a r t of larynx. It i s e n l a r g e d when the SPEAKER K.E. Fig. I-A- II. Average spectrum. Whispered vowels. I I 6 8 kHz -. ...- -#)dB ------. _ . _ REL. NORMAL SPEECH LEVEL -"OdB *1OdB " - -- *2OdB *a SPEAKER J.S. 10 # I 2 Fig. I-A- 12. 4 Average spectrum at different voice effort. 6 8 kHz Male speaker J. S. -.. -..-..- -------- SPEECH LEVEL -10dB REL. NORMAL OdB 8 *10dB 8- +2OdB I .3.-.- I S - - SPEAKER K.E. 8 kHz Fig. I-A-13. Average spectrum at different voice effort. Male speaker K. E. C. STL-QPSR 4/i970 voice intensity i n c r c a s c s , which probably c a u s c s ncw for=ants to be f o r m c d compared with n o r m a l spccch. A marc thorough explanation i s m a d e by J . Sundbcrg in this i s s u c of STL-QPSR. Rcfcrcnccs: ( i ) G. F a n t : "Acoustic Analysis and Synthesis of Spcech with Applications to Swedish", E r i c s s o n Tcchnics, No. 1 (1959). (2) G. F a n t : "Spccch Analysis and Synthc sisI1, R s y z l Llstitutc of Tc chnology, Speech Transrilission L a b o r a t o r y R c p o r t No. 26(1962), pp. 24-25. (3) G. F a n t : "Undersokning a v 10 scli s t a n d a r d f r a s " , L M E r i c s s o n p r o t o l ~ o l ln r H/P- 105 1. (4) 0. Tosi: "Detection and M e a s u r c m c n t of P a u s c s in Spcech", A m e r i c a n Spccch and Hearing Association ( 1966). (5) C. SjZSholm: "Arxplitudfordelning f o r tallt, T c l c v c r k c t s Provnings a n s t a l t protokoll n r 243/67. (6) hl. Blomberg and K. Elenius: "Statistisk analys zv taisignaler", T h c s i s v~orl:a t the Departmcnt of Spccch Communication, liTH, Stoclcholm (1970). (7) H. IS. Dunn and S. D. Vhitc: "Statistical Measurements on C o n v c r s a tional Speech1', J . Acoust. Soc. Am. 11 (1940), pp. 278-288. (8) N. R. F r e n c h and J . C. Stcinberg: " F a c t o r s Governing thc Intclligibility of Spc c c h Sounds", J. ~ c o u s t : ~ o c~.1 1 1 . 19 ( 1 9 4 7 ) ~pp. 90 - 119.

Statistical analysis of speech signals

Related documents

Products

Support

Statistical analysis of speech signals

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib