Statistical analysis of speech signals

advertisement
Dept. for Speech, Music and Hearing
Quarterly Progress and
Status Report
Statistical analysis of speech
signals
Blomberg, M. and Elenius, K. O. E.
journal:
volume:
number:
year:
pages:
STL-QPSR
11
4
1970
001-008
http://www.speech.kth.se/qpsr
STL-QPSR 4/1970
I,
A'
SPEECH ANALYSIS
STATISTICAL ANALYSIS O F SPEECH SIGNALS
lif, Blomberg and I<. E l e n i u s
This i s a condensed r e p o r t of a t h e s i s study c a r r i e d out a t the D e p a r t m e n t of Speech Communication in 1970,
The purpose was to d e t e r m i n e ,
f o r continuous speech, peakfactor, f o r m f a c t o r , long-time a v e r a g e s p c c t r u m of voiced and v o i c e l e s s sections s e p a r a t e l y , s p e c t r a l density at diff e r e n t voice intensity l e v e l s r d i s t r i b ~ l t i o nof the speech-wave amplitude,
s t a t i s t i c s on pause lengths and long-time a v e r a g e RMS of the speech wave.
All t a s k s have been solved using the CDC computer of the Department.
Speech m a t e r i a l
Sixteen 1-min long sections of conversational speech w e r e r e c o r d e d with
the p a r t i c i p a n t s on s e p a r a t e channels of a twin-traclr t a p e - r e c o r d c r .
The
recording w a s m a d e by the Swedish Board of Telecommunications ( T e l e verket), F a r s t a .
A limitation of this m a t e r i a l was the relative low d e g r e e
of c r o s s - c h a n n e l attenuation, 25-30 dB.
This distortion was found to im-
p a i r the s t a t i s t i c s on pause lengths only.
F o r this p a r t i c u l a r study we r e -
corded another m a t e r i a l with higher intcr channel
attenuation.
Distribution of s ~ e e c h - w a v ea m ~ l i t u d c
The amplitude distribution of speech was m e a s u r e d with the equipment
shown in Fig. I-A- 1.
The computer s a m p l e s in r e a l time the rectified and
s h o r t - time (13 m s e c ) integrated speech wave and a s s e m b l e s the distribution
by adding 1 to the word in s t o r a g e that c o r r e s p o n d s to the quantized l e v e l of
the sample.
of 2 dB.
The distribution i s displayed a s a h i s t o g r a m with a resolution
Fig. I-A-2 shovrs a typical h i s t o g r a m of onc of the s a m p l e s of con-
v e r s a t i o n a l specch, e a r l i e r described.
The amplitude of the curve i s a
m e a s u r e of the probability that the signal level will have
the value given by
the x - a x i s .
The c u r v e shows two distinct maxima.
The one f o r higher signal l e v e l s
depends on the s p e a k e r that i s to be m e a s u r e d .
The oilc f o r lower l e v e l s i s
due to noise and the voice of the s p e a k e r on the o t h c r channel.
The distance
between the m a x i m a gives a m e a s u r e of the signal-to-noise r a t i o , o r a s in
t h i s c a s e , the attenuation between the channels which evidently i s 25-30 dB.
An exact m e a s u r e i s not possible, a s thc long-time RMS of the s p e a k e r s car,not be d e r i v e d f r o m the histogram.
b
7
RECOW
DER
a
'
k
LOG,
RECT.
INTEGR.
'ti=13ms
CONV.
>
COMPUTER
CDC 1700
A
Fig. I-A- I .
The equipment used to measure the distribution of the speech-wave amplitude.
SPEAKER K 2
Fig. I-A-2.
Histogram of the distribution of the speech-wave amplitude. Conversational speech.
STL-QPSR 4/1970
F o r a p p r o p r i a t e m e a s u r e m e n t s on pause lengths, it i s of importance
where the l e v e l of the pause threshold i s situated.
It should be s e t a s low
as possible without allowing noise to excced the threshold.
Fig. I-A-3 shows
a n amplitude h i s t o g r a m where only one spcal-,er i s talking.
The noise den-
sity m a x i m u m i s about 30 dB under the speech density maximum.
The
choice of the pause threshold l e v e l should be ca 15-20 dB undcr the speech
density maximuizl, \*/l~erc
the h i s t o g r a m h a s a minimum.
length i s then l e a s t dependent of the threshold level.
The total pause
This valuc: i s about
35 dB l e s s than the highest spcech l e v e l during the m e a s u r e d interval.
P a u s e -length s t a t i s t i c s
T h e r e a r c two c s sential p a r a m e t e r s that define the acoustic boundaries
of a p a u s e , namely, maximum amplitude within the pause and i t s minimum
duration.
The f o r m e r puts an upper l i x i t to the range of speech-wave ampli-
tudes which might c o m p r i s e a pause.
The l a t t e r fixe s the s m a l l e s t amount
of t i m e during which thcse acoustic amplitudes m u s t flow in o r d e r to continue
a pause.
O'iherwise inter-wave " g a p s t t , that i s , m o m e n t s of relatively l e s s
amplitude that usually a r e p r e s e n t in a complex speech wave, might be d e tected a s pauses.
The duration of the inter-wave "gap" i s a v e r y s h o r t one
and depends on the fundamental frequency, the p a r t i c u l a r shape of the wave
and the l e v e l of the pause-maximum amplitude.
It i s reasonable to s t a t e
that f o r complex waves of fundamental frequency of 100 Hz the duration of
the inter-wave "gaps" i s n e c e s s a r i l y s h o r t e r than the period, 10 m s e c .
P a u s e s , o r m o s t p a u s e s , l a s t m o r e than these durations(().
r a t e of the rectified spcech wave was I600 Hz.
The sampling
This should be enough to
exclude i n t e r -wave "gaps".
Automatic pause-length a n a l y s i s i s of c o u r s e effected by clicks, gurgitations, a i r - s t r e a m noise, and other a s s o r t e d sounds.
These sounds will s o m e -
tiriles excced the amplitude threshold and v ~ i l l ,a t w o r s t , ripple just about the
threshold level causing s e v e r a l s h o r t p a u s e s (usually 10 to 30 m s e c ) in a
speech segment, which otherwise would be considered a s being one pause,
since i t adds nothing to the speech intelligibility.
R a i s i n g the p a u s e - m a x i m u m
amplitude to eliminate t h e s e effects will cause the mealcest consonants to be
considered a s pauses.
However, f o r pausc lengths longer than approximately
30 m s e c the pause s t a t i s t i c s m o s t probably differ little f r o m those m e a s u r e d
with the effects above eliminated.
SPEAKER I.K.
Fig. I-A-3.
Histogram of the distribution of the speech-wave amplitude. Only one speaker.
0
0 100
Fig. I-A-4.
Pause-length statistics.
0200
12.5
- 250 msec.
sec
I
0
*
I
I
02
Fig. I-A-5.
I
.4
I
I
I
.8
-6
Pause-length statistics.
I
12.5
- 1000 msec.
I
sec
STL-QPSR 4/1970
Long-time m e a n value
The long-time m e a n value of the rectified speech wave (LMV) was m e a s u r e d by averaging the rectified and integrated specch signal. Together with
thc f o r m f a c t o r ( f ) it will give the long-time a v e r a g e
f =
RMS (LRMS), since
LRMS
Li\AV
In o r d e r to m e a s u r e the LMV f o r speech only and not f o r a l l p a u s e s , the following c r i t e r i o n h a s been used.
If the level of thc rectified and integrated
speech wave i s lower than a given threshold f o r a s h o r t e r t i m e than 100 m s e c
i t will be used in the calculation of LMV, while p a u s e s longer than 100 m s e c
will be neglected.
The boundary was chosen a t 100 m s e c a f t e r studying oc-
clusion lengths in continuous speech.
p a r a m e t e r was not c r i t i c a l .
L a t e r r n e a s u r e m c n t s showed that t h i s
The s a m e r e s u l t s w e r e obtained using the
boundaries 50 m s e c and 500 m s e c .
Likewise the o t h e r p a r a m e t e r , the
threshold level, was not c r i t i c a l e i t h e r .
Changing it by 20 dB did not ef -
f e c t the r e s u l t which, however, was quantized in s t e p s of 2 dB.
The value
of LRMS was l a t e r u s e 6 a s a r e f e r e n c e l e v e l when a speech controlled, loudspeaking telephone s y s t e m was simulated by the computer.
The r e s u l t s of
t h i s investigation will be d i s c u s s e d in a forthcoming r e p o r t .
Formfactor
The f o r m f a c t o r was derived in the following
The speech signal w a s
sampled and digitalized.
In r e a l time, the computer accumulates s a m p l e s
2
2nd c s l c u l a t e s the s u m s C xn and C xn. The f o r m f a c t o r i s then computed
x
n
f r o m the equation:
T1!ith this definition the f o r m f a c t o r of a sinc wave will be 1 dB.
P a u s e s in
speech w e r e t r e a t e d the s a m e way a s when calculating LMV ( s e e above).
The sampling frequency u s e d was 10 kHz.
securing a complete waveform reconstruction.
This i s somewhat too low f o r
However, it can be shown ( 6 )
that the f o r m f a c t o r can be derived f r o m the s t a t i s t i c a l distribution of the
signal level.
The problem i s then reduced to choosing a sampling density
that allows a sufficiently good approximation to the distribution.
dition i s e a s i e r to meet.
This con-
STL-QPSR 4/19'70
5.
M e a s u r e m e n t s on conversational speech of 1 m i n length varying the
sampling frequency between 50 and 10.000 H z have verified this. See
Table I-A- 1.
Table I-A- I. F o r m f a c t o r a t different sampling
frequencies. The f o r m f a c t o r of
e a c h s p e a k c r was determined. The
r e s u l t s a r c shown in Table I-A-2.
r
1
Sampling frequency
50
500
5000
Formfactor
1.4
1.6
1.6
10000
Hz
1.6
6
Table I-A-2.
/
I
Distribution of f o r m f a c t o r s
1
Formfactor
Number of m a l e subjects
"
"female
I
/
3dBI 4dB
1
1
7
1
2
1
6
13
/
0
I
,I
i
It
Total number
5dB
,
1
&
The f o r m f a c t o r v a r i e s between 3 and 5 d 3 with a m e a n value of about 4 dB.
This c o r r c s p o n d s well to measurements by
ant'^),
who found a value of
3 . 5 dB.
Peakfnctor
The peakfactor, defined a s
l a r g e s t value during the o b s e r v e d t i m e
= RMS
- 11 h a s been m e a s u r e d f o r 1 m i n long p a r t s of speech and 10 s e c long sections
of the s a m e speech.
The r e s u l t s give a peakfactor of 16.5 f 1.0 dB f o r the
1 m i n long p a r t s , while the corresponding value f o r the sections of 10 s e c
was 17.0
+-
1.5 dB.
The s m a l l difference indicates that the peakfactor d o e s
not v a r y much with durations longer than 10 seconds.
=ant(') and ~ ~ 6 h o l mhave
( ~ ) m c a s u r c d a pealcfactor of 17 dB, which conf o r m s quite well with these r e sults.
STL-QPSR 4/1970
Long-time a v e r a g e speech s p e c t r u m
The 5 1-channel filterbank w a s u s e d f o r mea s u r e m e n t s of long- t i m e
a v e r a g e speech s p e c t r u m , including s e p a r a t e s t a t i s t i c s f o r voiced and
v o i c e l e s s se gmcnts.
The computer p r o g r a m involved a calculation of thc:
m e a n value of sampled s p c c t r u m sections.
The total range of the filterbank was 0 -9.6 kHz.
Constant f i l t e r band-
widths of B = 250 Hz and a m e l - s c a l e frequency display was choscx.
The
detection of voiced and voiceleas segments of speech employed one channel
with n o r m a l high-frequency p r e - e m p h a s i s (6 d ~ / o c t , 200-5000 HZ) and one
with l o w - p a s s filtering (330 Hz) of the signal, s e e F i g . I-A-6.
Sound i s
detected when the output level of the p r e - e i n p h a s i z e d signal is higher than
a threshold l e v e l , located c a 30 dB bciow the m a x i m u m intensity during
the m e a s u r e d interval.
This i s not exactly comparable, though, to the
pause-maximum amplitude level mentioned above, a s the weaker v o i c c l e s s
f r i c a t i v e s will be s t r o n g e r because of the p r e - e m p h a s i s .
If the output of the
LP - f i l t e r e d signal i s higher r e s p . lower than a n adjusted level, the sound i s
judged a s voiced r e s p . voiceless.
TTThenthe condition f o r the sound c l a s s
that i s to be m e a s u r e d a r e m e t , the f i l t e r s of the filterbanlc a r e synchronousl y sampled, and the s p e c t r u m sections a r e added to a n accumulative a r e a in
the computer s t o r a g e to f o r m the a v e r a g e s p c c t r u m .
lMaximum sampling
r a t e i s about 160 Hz.
The effects of the low c r o s s - c h a n n e l attenuation w e r e eliminated by
m e a n s of sound detection o:l the o t h e r channel (channel 2 ) than the one m e a s ured.
V h c n sound i s detected no m e a s u r e m e n t i s made.
Otherwise, voiced
segments of the spealcer on this channel might be detected a s v o i c e l e s s on
channel 1 and give a wrong a v e r a g e s p e c t r u m of v o i c e l e s s segments.
Average s p e c t r a with no s e p a r a t i o n voiced-voiceless f o r one m a l e and
one female s p e a k e r that have been computed by t h i s method a r e shown in
Fig. I-A-7 and I-A-8.
They a r e c o m p a r e d with s p e c t r a a f t e r Dunn & White (7
and F r e n c h & steinberg(').
T h e r e i s no g r e a t difference to be found.
Separate s p e c t r a of voiced and v o i c c l e s s s e g m e n t s of the s a m e s p e a k e r s
a r e shown in F i g s . I-A-9 and I-A-10.
Voiced sounds, being m o r e frcqucnt
and s t r o n g e r than the unvoiced, dominate the s p e c t r u m of the total speech
wave. Especially, t h i s i s the c a s e f o r f r e q u e n c i e s below 3 kHz. Over 3 kHz
s p e c t r u m f a l l s m o r e rapidly.
-------.-.-.-
Fig. I-A-7.
MEASURED SPECTRUM. M I
AVERAGE SPECTRUM AFTER DUNN & WHITE (MALE)
IDEALIZED AVERAGE SPECTRUM AFTER FRENCH & STEINBERG
I
I
I
I
3
4
6
8 kHz
Long-time average spectrum.
Male subject, M I .
40 -
MEASURED SPECTRUM, K4
Fig. I-A-8.
I
I
I
I
3
4
6
8 kHz
Long-time average spectrum.
Female subject, K4.
-
--.-..
'7
SPEAKER
-
Fig. I-A- 10.
K4
I
I
I
I
3
4
6
8
Long-time average spectrum. Separation voiced-voiceles~.
Voiced spectrum attenuated 10 dB. Female subject, K4.
kHz
The v o i c e l e s s sounds, yvhich a r e c h a r a c t e r i z e d by l a c k of glottal pulses,
include not only v o i c c l c s s f r i c a t i v e s but a l s o a s p i r a t i v e s , b u r s t s a f t e r stops,
and, m o r e o r l e s s d e s i r a b l e , brca'hing
sounds before sentences.
These
sounds and d a r k / s j / will contrihutc to a s p e c t r u m shape deviating f r o m the
expected high-frequency e m p h a s i s .
F i g . I-A- 1 I shows a n a v e r a g e s p e c t r u m f o r the a s p i r a t i v e , whispered
and o 1' The e n e r g y i s mainly concentrated below 2 kHz. This finding j s in a g r e e m e n t with the a v e r a g e s p e c -
vowels a I , O1,
U i ~
i!,
Y i , a3
9
t r u m of v o i c e l e s s sounds In this r e s i o n , because the a s p i r a t i v e s constitute
a g r e a t p a r t of these.
S p e c t r a of voiceless sounds have a n e n e r g y minimum between 3 and 4 kHz.
Most probably, this depends on the fact that neithcr the a s p i r a t i v e s n o r the
f r i c a t i v e s contain any l a r g e r amount of c n e r g y a t t h i s frequency.
The f r i c a -
t i v e s , where / s / i s the m o , r t f r c c j ~ ~ c have
~ t , the m a i n p a r t of t h e i r e n e r g y
above 4 kHz and the a ~ p i ~ a t i v belaw
cs
2 kHz.
Average s p e c t r u m a t different voice intensities
Two subjects made a 2 0 - s e c long u t t e r a n c e , the s a m e f o r both, 4 t i m e s ,
varying the voice intensity in 10 clB s t e p s between
speech a s m e a s u r e d by a VU-meter.
l e s s separation were measured.
- 10 and
f 2 0 dB r e l . n o r m a l
Average s p e c t r a without voiced-voice-
The r e c o r d i n g and playback l e v e l s w e r e
adjusted s o that low-voicc effort would give about the s a m e output signal a s
high.
This was done in o r d e r to get the s a m e p r e c i s i o n f o r a l l intensities.
The s p e c t r a a r e shown in Fig. I-A- 12 and I-A- 13.
High-voice effort
differs f r o m iow rilainly by the s p e c t r u m slope below 1 kHz.
J. S . , the intensity f a l l r f r o m 300 to 1000 Hz ca 18 dB f o r
sity and only c a 5 dB f o r t 2 0 dB.
a r e 35 r e s p . 10 dB.
F o r speaker
- 10 dB voice
inten-
The corresponding values f o r s p e a k e r K.E.
This effect i s due to the voice source.
At low-voice
effort, s o u r c e s p e c t r u m f a l l s m o r e rapidly in this frequency region(').
The
level in the fundamental frcqucncy region j.s higher f o r low-voice intensity,
which i s a n a p p a r e n t f e a t u r e of the f i g u r e s .
A combining effect i s a l s o the
i n c r e a s e of F O a t higher voice intensities.
F o r medium and h i g h - v ~ i c eeffort, t h e r e i s a distinct peak between 3 and
4 kHz.
This i s the s o - c a l l e d "singing f o r m a n t t ' . It i s r e l a t e d to the simcls
p i r i f o r m i s , a cavity around the upper p a r t of larynx.
It i s e n l a r g e d when the
SPEAKER K.E.
Fig. I-A- II. Average spectrum. Whispered vowels.
I
I
6
8
kHz
-. ...- -#)dB
------. _ . _
REL. NORMAL SPEECH LEVEL
-"OdB
*1OdB
"
-
--
*2OdB
*a
SPEAKER J.S.
10
#
I
2
Fig. I-A- 12.
4
Average spectrum at different voice effort.
6
8 kHz
Male speaker J. S.
-.. -..-..-
--------
SPEECH LEVEL
-10dB REL. NORMAL
OdB
8
*10dB
8-
+2OdB
I
.3.-.-
I
S
-
-
SPEAKER K.E.
8 kHz
Fig. I-A-13.
Average spectrum at different voice effort.
Male speaker K. E.
C.
STL-QPSR 4/i970
voice intensity i n c r c a s c s , which probably c a u s c s ncw for=ants to be f o r m c d
compared with n o r m a l spccch.
A marc thorough explanation i s m a d e by
J . Sundbcrg in this i s s u c of STL-QPSR.
Rcfcrcnccs:
( i ) G. F a n t : "Acoustic Analysis and Synthesis of Spcech with Applications
to Swedish", E r i c s s o n Tcchnics, No. 1 (1959).
(2) G. F a n t : "Spccch Analysis and Synthc sisI1, R s y z l Llstitutc of Tc chnology,
Speech Transrilission L a b o r a t o r y R c p o r t No. 26(1962), pp. 24-25.
(3)
G. F a n t : "Undersokning a v 10 scli s t a n d a r d f r a s " , L M E r i c s s o n
p r o t o l ~ o l ln r H/P- 105 1.
(4) 0. Tosi: "Detection and M e a s u r c m c n t of P a u s c s in Spcech", A m e r i c a n
Spccch and Hearing Association ( 1966).
(5) C. SjZSholm: "Arxplitudfordelning f o r tallt, T c l c v c r k c t s Provnings a n s t a l t protokoll n r 243/67.
(6) hl. Blomberg and K. Elenius: "Statistisk analys zv taisignaler",
T h c s i s v~orl:a t the Departmcnt of Spccch Communication,
liTH, Stoclcholm (1970).
(7) H. IS. Dunn and S. D. Vhitc: "Statistical Measurements on C o n v c r s a tional Speech1', J . Acoust. Soc. Am. 11 (1940), pp. 278-288.
(8)
N. R. F r e n c h and J . C. Stcinberg: " F a c t o r s Governing thc Intclligibility
of Spc c c h Sounds", J. ~ c o u s t : ~ o c~.1 1 1 . 19 ( 1 9 4 7 ) ~pp. 90 - 119.
Download