Basic Features of Audio Signals
( 音訊的基本特徵 )
Jyh-Shing Roger Jang ( 張智星 ) http://mirlab.org/jang
MIR Lab, CSIE Dept
National Taiwan Univ., Taiwan
Four commonly used audio features
Volume, pitch, timbre, zero crossing rate
Our goal
These features can be perceived (more or less) subjectively.
Our goal is to compute them quantitatively (and objectively) for further processing and recognition.
1.
Frame blocking
Frame duration of 20~40 ms or so
2.
Frame-based feature extraction
Volume, zero-crossing rate, pitch, MFCC, etc
3.
Frame-based Analysis
Pitch vector for QBSH comparison
MFCC for HMM evaluation
…
0.3
0.2
0.1
0
-0.1
-0.2
-0.3
-0.4
0 500 1000 1500
Overlap
Quiz candidate!
Sample rate = 16 kHz
Frame size = 512 samples
Frame
Zoom in
Frame duration = 512/16000 = 0.032 s = 32 ms
Overlap = 192 samples
Hop size = frame size – overlap = 512-192 = 320 samples
Frame rate = 16000/320 = 50 frames/sec
0.3
0.2
0.1
0
-0.1
-0.2
-0.3
-0.4
0
2000
50 100 150
2500
200 250 300
3 of the most prominent time-domain audio features in a frame (also known as analysis window) taiwan.wav
1
0.5
Quiz candidate!
0
-0.5
-1
0.5
1 3 1.5
Sample index
2.5
3.5
x 10
4
Intensity
1
0.5
0
-0.5
-1
50 500 100 150 200 250 300 350 400
Sample index within the frame Timbre: Waveform within an FP
450
Frequency-domain audio features in a frame
Energy: Sum of power spectrum
Pitch: Distance between harmonics
Timbre: Smoothed spectrum
Second formant
F2
First formant
F1
Pitch freq
Energy
For simplicity, we usually pack frames into a matrix for easy manipulation in MATLAB:
[y, fs] = audioread(‘file.wav’);
frameMat = enframe(y, frameSize, overlap); frameMat =
…
Loudness of audio signals
Visual cue: Amplitude of vibration
Also known as energy or intensity
Two major ways of computing volume: n
Volume: vol
i
1 s i
Log energy (in decibel):
Quiz candidate!
energy
10*log
10
i n
1 s i
2
Perceived volume is influenced by
Frequency (example shown later)
Timbre (example shown later)
Computed volume is influenced by
Microphone types
Microphone setups
To avoid DC bias (or DC drifting)
DC bias: The vibration is not around zero
Computation:
Volume: vol
s i
i n
1
Log energy (in decibel): energy
10*log
10
i n
1
s i
2
Theoretical background (How to prove them?) n s s
,
1 2
,
1 2
,...,
,..., s s n n
arg min x arg min x i i
1 n
1
s i
median s s i
x
2
Quiz candidate!
Functions for computing volume
Example: volume01
Example: volume02
Example: volume03
Volume depends on…
Frequency
Equal loudness test
Timbre
Example: volume04
Zero crossing rate (ZCR)
The number of zero crossing in a frame.
Characteristics :
ZCR is higher for noise and unvoiced sounds, lower for voiced sounds.
Zero-justification is required before computing ZCR.
Usage
For endpoint detection, especially in detection the start and end of unvoiced sounds.
To distinguish noise from unvoiced sound, usually we add a shift before computing ZCR.
Quiz candidate!
Two types of ZCR definitions
If a sample with zero value is considered a case of ZCR, then the value of ZCR is higher.
Otherwise its lower.
The distinction diminishes when using a higher bit resolution.
Other consideration
ZCR with shift can be used to distinguish between unvoiced sounds and silence.
But it is hard to set up the right shift amount.
ZCR computing
Example: zcr01
Example: zcr02
To use ZCR to distinguish between unvoiced sounds and environmental noise
Example: Example: zcrWithShift
Definition
Pitch is also known as fundamental frequency, which is equal to the no. of fundamental period within a second. The unit used here is Hertz (Hz).
Unit
More commonly, pitch is in terms of semitone, which can be converted from pitch in Hertz: semitone
69 12*log
2
Hz
440
Quiz candidate!
Piano roll via HTML5
Pitch of tuning forks ( code )
Quiz candidate!
fp ff
( 189
7 ) / 5 / 16000
0 .
002275 sec
1 / fp
439 .
56 Hz pitch
69
12 * log
2 ff
440
68 .
9827 semitone
Pitch of speech ( code )
Quiz candidate!
fp ff
( 477
75 ) / 3 / 16000
1 / fp
119 .
403 Hz
0 .
008375 sec pitch
69
12 * log
2 ff
440
46 .
42 semitone
Some statistics about
Mandarin Chinese
5401 characters, each character is at least associated with a base syllable and a tone
411 base syllables, and most syllables have 4 tones, so we have 1501 tonal syllables
Syllables with 3 or less tones
媽麻馬罵、當檔蕩、 嗲
More examples
1234 :三民主義、三國
演義、優柔寡斷
?????
:美麗大教堂、滷
蛋有 夠鹹(
Taiwanese )
Tone sandhi :勇猛果敢
Tone is characterized by the pitch curves:
Tone 1: high-high
Tone 2: low-high
Tone 3: high-low-high
Tone 4: high-low
Quiz candidate!
(Put you hand on your throat and you can feel it…)
Tone recognition is mostly based on features obtained from pitch and volume
TTS: Text to speech ( demo )
Tone Sandhi: phonological change occurring in tonal language
3+3
2+3
總統、總統府、李總統、母老虎、膽小鬼
不
不好、不難 vs. 不對、不妙
一
一個、一次、一半 vs. 一般、一毛、一會兒
雙音節詞連音組合
Tone Sandhi of 3+3
請老李給我買五把好雨傘
老李買好酒請馬小姐買幾百把小雨傘
總統府裏的李總統有點想請我買酒
北海只有兩里遠,水也很淺
展覽館北館有好幾百種展覽品
你早晚打掃,我 啃水果咬水餃
我很了解你,我倆永遠友好
水管可以點火,趕緊買保險
Quiz candidate!
If audio is played at a higher sample rate…
Pitch is higher
Duration is shorter
Pitch change due to sample rate change at playback
Sample rate: fs
k*fs (at playback)
Duration: d
d/k
Fundamental frequency: ff
k*ff
Pitch: pitch
pitch+12*log
2
(k)
Quiz candidate!
Age-related hearing loss
As one grows old, the audible frequency bandwidth is getting narrower
Mosquito ringtone
Low to high , high to low
Applications
Frequencies vs. ages
25
20
15
21k
17.4k
15k
12k
10
5
8k
0
18 24 40 50
Age
100
Some interesting phenomena about pitch
Beat
Quiz candidate!
Doppler effect
Shepard tone
Quiz candidate!
How to create these effects in MATLAB?
An auditory illusion of a tone that continually ascends or descends in pitch
Overtone singing
Timbre is represented by
Waveform within a fundamental period
Frame-based energy distribution over frequencies
Power spectrum (over a single frame)
Spectrogram (over many frames)
Frame-based MFCC (mel-frequency cepstral coefficients)
Simulink model for real-time display of spectrogram
dspstfft_audio (Before MATLAB R2011a)
dspstfft_audioInput (R2012a or later)
Spectrum : Spectrogram :