Basic Features of Audio Signals (音訊的基本特徵) Jyh-Shing Roger Jang (張智星) http://www.cs.nthu.edu.tw/~jang MIR Lab, CS Dept, Tsing Hua Univ. Hsinchu, Taiwan Audio Features Four commonly used audio features Volume Pitch Zero crossing rate Timber Our goal These features can be perceived subjectively. But we need to compute them quantitatively for further processing and recognition. Audio Features in Time Domain Audio features presented in the time domain Fundamental period Intensity Timbre: Waveform within an FP Audio Features in Frequency Domain Volume: Magnitude of spectrum Pitch: Distance between harmonics Timber: Smoothed spectrum First formant F1 Intensity Pitch freq Second formant F2 Demo: Real-time Spectrogram Try “dspstfft_audio” under MATLAB: Spectrum: Spectrogram: Steps for Audio Feature Extraction Frame blocking Frame duration of 20 ms or so Feature extraction Volume, zero-crossing rate, pitch, MFCC, etc Endpoint detection Usually based on volume & zero-crossing rate Frame Blocking 0.3 0.2 0.1 0 -0.1 -0.2 -0.3 -0.4 0 500 1000 1500 Overlap 2000 2500 0.3 0.2 0.1 0 Sample rate = 11025 Hz Zoom in Frame Frame size = 256 samples Overlap = 84 samples (Hop size = 256-84) Frame rate = 11025/(256-84)=64 frames/sec -0.1 -0.2 -0.3 -0.4 0 50 100 150 200 250 300 Intensity (I) Intensity Visual cue: Amplitude of vibration Computation: n Volume: vol s i 1 i n 2 Log energy (in decibel): energy 10*log10 si i 1 Characteristics Influenced by microphone types Microphone setups Perceived volume is influenced by frequency and timbre Intensity (II) To avoid DC drifting DC drifting: The vibration is not around zero Computation: n Volume: vol si median s i 1 2 n energy 10*log s mean s 10 i Log energy (in decibel): i 1 Theoretical background (How to prove?) n s s1 , s2 ,..., sn arg min si x median s x i 1 n s s1 , s2 ,..., sn arg min si x mean s x i 1 2 Intensity (III) Examples Please refer to the online tutorial Pitch Definition Pitch is known as fundamental frequency, which is equal to the no. of fundamental period within a second. The unit used here is Hertz (Hz). More commonly, pitch is in terms of semitone, which can be converted from pitch in Hertz: Hz semitone 69 12*log 2 440 Pitch Computation (I) Pitch of tuning forks ff 16000/ 187 7 / 5 439.56 Hz ff pitch 69 12* log2 68.98 sem itone 440 Pitch Computation (II) Pitch of speech ff 16000/ 477 75 / 3 119.403 Hz ff pitch 69 12* log2 46.42 sem itone 440 Statistics of Mandarin Chinese 5401 characters, each character is at least associated with a base syllable and a tone 411 base syllables, and most syllables have 4 ones, so we have 1501 tonal syllables Tone is characterized by the pitch curves: Tone 1: high-high Tone 2: low-high Tone 3: high-low-high Tone 4: high-low Some examples of tones: 1242:清華大學 1234:三民主義、優柔寡斷、搭達打大、依宜以易、夫福府負 ?????:美麗大教堂、滷蛋有夠鹹(Taiwanese) Sinusoidal Signals How to generate a stream of sinusoidal signals fs=16000; duration=3; f=440; t=(1:fs*duration)/fs; y=0.8*sin(2*pi*f*t); plot(t,y); axis([0.6, 0.65, -1 1]); sound(y, fs); Zero Crossing Rate Zero crossing rate (ZCR) The number of zero crossing in a frame. Characteristics: Noise and unvoiced sound have high ZCR. ZCR is commonly used in endpoint detection, especially in detection the start and end of unvoiced sounds. To distinguish noise/silence from unvoiced sound, usually we add a bias before computing ZCR. ZCR Computations Two types of ZCR definition If a sample with zero value is considered a case of ZCR, then the value of ZCR is higher. Otherwise its lower. It affects the ZCR, especially when the sample rate is low. Other consideration Zero-justification is required. ZCR with shift can be used to distinguish between unvoiced sounds and silence. (How to determine the shift amount?) ZCR Examples Please refer to the online tutorial.