SPEECH ANALYSIS ChenYan Li University of Oslo

advertisement
SPEECH ANALYSIS
ChenYan Li
University of Oslo
Project will be accomplished in DSB-lab of IFI.
Several people talk through the microphone respectively, Theirs voice
will be recorded and converted to digital signal, and then save these in
matlab as sample waves. Find feature of their voice by detecting pitch.
According feature of their voice to decide the gender of the speaker.
There are tow ways to analyze the speech signal: Time-Domain speech
analysis and Frequency-domain speech analysis. First, I use Time
Domain analysis by measuring acoustic parameters that vary over time
period of the utterance, the equivalent acoustic parameter is the short
time energy and short-time autocorrelation:
Where N1 and N2 are the frame boundaries and i is a frame index. Basic
Idea of short-time processing: short segments of the speech signal are
isolated and processed as if they were short segments from a sustained
sound with fixed properties. The short segments are called frames. The
Autocorrelation estimate fundamental frequency from the waveform
directly, this function for a section of signal shows how well the
waveform shape correlates with itself at range of different delays. This
technique detect the peak amplitude in the function to estimate the pitch
period. The idea is that the frame is shifted through the signal with a
frame size of 30ms and where subsequent frames will be overlap by 50%.
Frequency-Domain: Produce a short-duration spectrum of the speech
signal through the use of the Fast Fourier Transform, make log of the
absolute value Of the spectrum, and then take inverse FFT of this log
value to get the cepstrum. To obtain an estimate of the fundamental
frequency form the cepstrum, I Look for a peak in the frequency region
corresponding to typical Speech fundamental frequencies.
I calculate the energy of the speech signal Ei and the short-time
autocorrelation, it also can be called voicing parameter. Where the
signal is voiced we will find an autocorrelation coefficient that is large
at a time lag corresponding to one period of the glottal vibration. Where
Ei is greater than an energy threshold and autocorrelation is greater than
a voicing parameter, I can determine the Fx for that frame by looking at
the time lag k at which the autocorrelation maximum occurs. The upper
two figures show time-domain analysis of the word “one, two, three,
four… ”spoken by two male speakers. The first panel shows the speech
signal’s waveform, the second panel shows the fundamental frequency for
those frames where both energy and autocorrelation are above their
respective thresholds. The third panel shows the short-time energy and
the forth panel shows the autocorrelation. The left figure shows that the
Fx trace has a lot of noise and I tried to adjust the threshold of the
voicing parameter, it doesn't make it better, so I tried another way:
frequency-domain method. Then I get better plot which we can see from
nether left figure.
I take off these unqualified frequency Fx traces and plot them on one
figure (upper right) as the function of the time. It is easy for us to see the
difference of the pitch frequency of the male and female. The pitch
frequency of the male is around 100 Hz and the pitch frequency of the
female is around 250Hz.
This system can detect the pitch frequency and make decision about the
gender of the speaker, but It can not distinguish the falsetto and true voice,
and then will make a wrong decision. Time-domain method make a good
decision about gender of the speakers, but some of the frequency result
will contain noise. Although Frequency domain method can get rid of
some noise of the frequency result but it is more useful for high pitch
frequency. In the future, it could be found one solution combined with
these two methods and get more better decision about the gender of the
speakers.
[1] LA.Atkinson, A.M. Kondoz and B.G. Evans: “Pitch detection of speech
Signals using segmented autocorrelation ”, Electronics Letters, IEEE 30th Mrch
1995 Vol.31 No.7
[2] Alain de Cheveign: ”Time-domain auditory processing of speech”, Journal
Of Phonetics, Vol.31, No.3, July 2003, pp. 547-561(15).
[3] Y. Medan, E. Yair and D. Chazan: ”Super Resolution Pitch Determination
Of speech signals ”, IEEE, Transactions on Signal Processing, Vol.39, No.1,
Jan 1991
[4] R. Sankar: ”Pitch Extraction Algorithm For Voice Recognition
Applications”, IEEE, System Theory, ISBN: 0-8186-0847-1, Mar 1988
[5] B. Gold, N. Morgan: ”Speech and Audio Signal Processing: Processing and
Perception of Speech and Music”, ISBN 978-0-471-35154-7
[6] http://www.speechtechmag.com
Download