SGN–14006 Audio and Speech Processing

advertisement
Introduction 1
SGN-14006 / A.K.
SGN–14006
Audio and Speech Processing
Introduction 2
Course goals
! 
SGN-14006 / A.K.
Learn basics of audio signal processing
–  Basic operations and their underlying ideas and principles
–  Give basic skills although all the latest cutting edge algorithms
cannot be covered
! 
Learn fundamentals of speech processing
–  Speech production and its computational modeling
–  Acoustic features to represent speech signals
–  Some applications: speech coding, synthesis
Lectures, Fall 2015
Pasi Pertilä
Tampere University of Technology
! 
Learn the basics of acoustics and human hearing
–  These form the foundation for technical applications
(slides by Anssi Klapuri)
Introduction 3
Lecture timeline (some changes may still take place)
! 
! 
! 
Sound, audio signals, acoustics
Hearing
Basic audio signal processing operations
– 
! 
! 
! 
! 
! 
! 
SGN-14006 / A.K.
Introduction 4
What is not covered by this course
! 
SGN-14006 / A.K.
Speech recognition, audio content analysis, and acoustic
pattern recognition
" Course SGN-24006 ”Analysis of Audio, Speech and Music
Signals” (period 4)
AD/DA-conversion, filters and filter banks, dynamic control, etc.
Sound synthesis
Audio coding
! 
Speech production anatomy, phonetics
Linear prediction, MFCCs, and cepstrum
Speech coding
Speech synthesis
! 
Analog audio
–  Electroacoustics, microphone and loudspeaker design
" See the course ”Akustiikan mittaukset”
Hardware implementations
Introduction 5
Practical arrangements
SGN-14006 / A.K.
! 
! 
Course homepage: http://www.cs.tut.fi/~sgn14006
!  Lectures
! 
! 
–  Mondays 12-14 in TB219
–  Thursdays 14-16 in TB222
–  Pasi Pertilä, pasi.pertila @ tut.fi
! 
Exercises
! 
! 
Requirements: exam and project work
5 cr
! 
! 
! 
Exercises start one week after the lectures (2.9.2015)
Assistants: Shriram Nandakumar, Emre Cakir
Contents: math and Matlab exercises related to the
lectures
Two alternative groups
Math problems are to be solved in advance, Matlab
exercises are done during the exercises
Active completion of the exercises and participation in the
exercises is credited up to 3 points in the exam
(equivalent to one mark)
Project work will be discussed at the exercises too
Introduction 7
Project work
! 
SGN-14006 / A.K.
Implementing an audio signal processing algorithm in
Matlab
–  In two-person groups
! 
! 
Topic(s) will be introduced later during the lectures
Requirements:
–  Choosing the topic
–  Implementing the algorithm
–  Final report by 28.10.
! 
More detailed instructions will appear on the course home
page
SGN-14006 / A.K.
–  Tuesday 10-12 in TC303 (updated!)
–  Friday 12-14 in TC303
–  Register to either group on-line at 14:00 today www.tut.fi/pop
Lecture slides will be available as pdf on the course page
–  Course is not based on any individual textbook. Lectures, lecture notes
and exercises will be sufficient to take the exam.
–  Some recommended textbooks are mentioned at the end of this
introduction
! 
Introduction 6
Introduction 8
Reference material
SGN-14006 / A.K.
! 
! 
Gold, Morgan, Ellis, ”Speech and audio signal processing,” Wiley, 2011.
Zölzer.”Digital audio signal processing,” Wiley&Sons, 2nd ed. 2008.
! 
T.F. Quatieri: "Discrete-Time Speech Signal Processing: Principles and
Practice", Prentice Hall PTR, 2002.
Rossing. ”The science of sound”, Addison-Wesley, 1990.
–  Including AD/DA-conversion, dynamic control, equalization, filter banks
! 
–  Acoustics, hearing
! 
Brandenburg, Kahrs. (1998). ”Applications of digital signal processing to audio
and acoustics,” Kluwer Academic Publishers
–  Chapter on Perceptual audio coding
! 
Pulkki, Karjalainen, ”Communication acoustic”,2015, Wiley
Introduction 9
SGN-14006 / A.K.
Introduction 10
Audio signals
! 
Introduction to audio signals and
their representation
! 
SGN-14006 / A.K.
Audio = related to sound or hearing
The word sound may mean
1.  a sensation perceived by the auditory system, or
2.  longitudinal pressure waves in a material medium (such as air)
that may cause a hearing sensation
–  Due to human hearing, we usually consider the frequency range
20 Hz – 20 kHz and air as the medium (although hearing works
also underwater for example)
! 
Sound signal – audio signal
–  Numerical representation of sound
–  Sound pressure level as a function of time, measured using a
microphone for example
! 
Note: audio signal is often understood as non-speech
audio signal, although speech signals are audio too
Introduction 11
Audio and speech processing
! 
! 
SGN-14006 / A.K.
Where is audio and speech processing needed?
Examples:
–  Convert a musical piece into compressed mp3 format and store it
on a hard disc for playback later (audio coding)
–  Encode a speech signal on a mobile phone before transmission
–  Add reverberation to a sound, correct the pitch of a singer (studio
technology)
–  Enhance the quality of a speech signal (denoising, echo cancell.)
–  Compensate for loudspeaker non-idealities by digital equalization
! 
Typical digital signal processing system:
1. Digitize a signal (sampling, quantization)
2. Process in digital form (store, manipulate, etc)
-digital representation enables a variety of algorithms
3. Convert back to an analog signal
Introduction 12
Audio signal representations
! 
SGN-14006 / A.K.
Different applications employ different representations
–  Time domain representation
–  Frequency domain representation
–  Time-frequency domain representation
! 
On this course we consider mainly music and speech
–  Music signals involve a wide variety of sounds, billions of people
listen to music worldwide
–  Speech signals are an important special category of sound signals
due to their importance for communication
Introduction 13
Time domain signal
! 
SGN-14006 / A.K.
Air pressure level as a function of time (zero level =
normal air pressure) is a natural representation for audio
Introduction 14
Time domain signal (1)
! 
–  An analog signal is easy to record using a microphone and play
back using a loudspeaker
! 
For music, typical sampling rates are 44.1 or 48 kHz
SGN-14006 / A.K.
Analog signal (solid line) can be represented with discrete
samples (dots) without loss of information, if the sampling
frequency ≥ 2 * highest frequency component in the signal
–  Remember from introductory signal processing courses
–  Allows for representing the frequency range of human hearing
(approximately 20 Hz – 20 kHz)
! 
For speech
–  8 kHz: Narrowband
•  the conventional telephone rate (sibilants /s/, /f/ distorted)
–  16 kHz: Wideband
•  voice over IP, bandwidth extension
! 
! 
Other rates are also widely used: 96, 32, 22.05 kHz etc.
Most of the energy (and information) of natural sounds is
at low frequencies (around 200 Hz – 5 kHz)
Introduction 15
Time domain signal (2)
! 
! 
SGN-14006 / A.K.
Large time scale illustrates the sound amplitude envelope
Example signal: one note from the oboe
–  Amplitude is zero before the sound starts
–  The oboe has continuous excitation, therefore the sound’s
amplitude envelope remains nearly constant throught it duration
Introduction 16
Time domain signal (3)
! 
! 
SGN-14006 / A.K.
Zoom-in of the same oboe signal at time t = 0.45 s
90 ms frame illustrates the periodic waveform
–  Many sounds are periodic, for example most musical instrument
sounds and vowels in speech
Introduction 17
Frequency domain representation – spectrum
! 
! 
! 
SGN-14006 / A.K.
Obtained by computing discrete Fourier transform (for
example) of the time-domain signal, usually in a short frame
Many perceptually important properties are more clearly
visible in the frequency domain
Decibel scale for amplitude is useful from the viewpoint of
the human hearing and the dynamics of natural sounds
–  Due to Fechner’s law (subjective sensation is proportional to the
logarithm of the stimulus intensity)
! 
Introduction 18
Consider log-frequency and dB-magnitude
! 
SGN-14006 / A.K.
Linear scale
–  usually
hard to ”see”
anything
! 
Log-frequency
–  each octave is
approximately
equally important
perceptually
Phases are perceptually less important – often omitted
! 
Log-magnitude
–  perceived change
from 50dB to 60dB
about the same as
from 60dB to 70dB
Introduction 19
Time-frequency representation – spectrogram
! 
! 
! 
SGN-14006 / A.K.
Shows sound intensity as a function of time and frequency
Obtained by blocking the signal into short analysis frames
and by computing their spectra
For audio, the frame size is typically 10–100 ms: sound
spectra are often nearly stationary at that time scale
Introduction 20
Example audio signals: guitar
! 
! 
! 
SGN-14006 / A.K.
Sound decays gradually after the onset
Instantaneous excitation: string is plucked at onset
Periodic sound (vibrating string, covered on Acoustics
lecture)
Introduction 21
Example audio signal: snare drum
! 
SGN-14006 / A.K.
Instantaneous excitation, exponentially decaying
amplitude envelope
Introduction 22
Example audio signals: snare drum (2)
! 
! 
Zoom-in of the snare drum waveform
The signal contains also non-periodic components
Introduction 23
Example audio signals: snare drum (3)
! 
SGN-14006 / A.K.
Spectrum is noise-like too: not as clear structure as that in
oboe’s spectrum
SGN-14006 / A.K.
Introduction 24
Example audio signals: snare drum (4)
! 
Spectrogram
SGN-14006 / A.K.
Introduction 25
Polyphonic music (1)
! 
SGN-14006 / A.K.
Polyphonic music consists of a mix of several sound
sources (linear superposition)
Introduction 26
Polyphonic music (2)
! 
Spectrogram reveals e.g. the rhythmic structure
Introduction 27
Speech: time domain signal (1)
! 
! 
SGN-14006 / A.K.
One sentence (”He knew what taboos he was violating.”)
Speech can be viewed as a sequence of phonemes
SGN-14006 / A.K.
Introduction 28
Speech: time domain (2)
! 
Zooming in to different phonemes
–  Left: vowel ”e” in He (voiced: periodic)
–  Right: ”t” in ”taboos” (unvoiced: ”noisy”)
SGN-14006 / A.K.
Processing​, ​School of Architecture​ and ​Civil
Engineering
AD#1
Introduction 29
Speech spectrogram
! 
! 
“​NERDS​ MEET ​ART​ISTS​”
2015-‐2016 Joint Course Module of ​Signal
Processing​, ​School of Architecture​ and ​Civil
Engineering
SGN-14006 / A.K.
Each phoneme has its characteristic spectral shape
Transitions between phonemes are continuous rather than
step-like
Introduction 30
SGN-14006 / A.K.
GOAL:
This course module invites students from signal processing, architecture and civil engineering. GOAL: Help signal processing engineers to understand needs of urban design and help architects and civil engineers to understand potential of modern ICT in quantitative analysis of urban spaces. With the help of camera and microphone systems automatic analysis is provided for quantitative urban space monitoring. The quantitative data is used for boosting architectural and civil engineering design of future urban spaces. COURSES (depends on your home department): ARK-­53806 ​Sustainable Design Studio RAK-­13106 ​Sustainable Development Studio SGN-­81006 ​
Signal Processing Innovation Project PARTICIPATION: This course module invites students from signal processing, architecture and civil engineering. Enroll to one of the above courses and come to the ​Opening Session August 25 2015 10:00-­12:00 RO104​ where the overall description is given and the project groups will be formed. The works will GOAL: be supervised by the researchers from Department of Signal Processing, School of Architecture Help signal processing engineers to understand needs of urban design and help architects and and Department of Civil Engineering. civil engineers to understand potential of modern ICT in quantitative analysis of urban spaces. With FOR MORE INFORMATION: the help of camera and microphone systems automatic analysis is provided for quantitative urban Harry Edelman (School of Architecture / Dept. of Civil Engineering) space monitoring. The quantitative data is used for boosting architectural and civil engineering Joni Kämäräinen (Dept. of Signal Processing -­ video processing) design of future urban spaces. Tuomas Virtanen (Dept. of Signal Processing -­ audio processing) Help signal processing engineers to understand needs of urban design and help architects and civil
engineers to understand potential of modern ICT in quantitative analysis of urban spaces. With the help of
camera and microphone systems automatic analysis is provided for quantitative urban space monitoring.
The quantitative data is used for boosting architectural and civil engineering design of future urban spaces.
COURSE: SGN-81006 S​ignal Processing Innovation Project
PARTICIPATION:
Enroll to the above course and come to the O​pening Session August 25 2015 10:00-12:00 RO104 ​where
the overall description is given and the project groups will be formed. The works will be supervised by the
researchers from Department of Signal Processing, School of Architecture and Department of Civil
Engineering.
FOR MORE INFORMATION:
Harry Edelman (School of Architecture / Dept. of Civil Engineering)
Joni Kämäräinen (Dept. of Signal Processing - video processing)
COURSES (depends on your home department): Tuomas Virtanen (Dept. of Signal Processing - audio processing)
Invitation to Data Collection CampaignIntroduction 31
AD#2, Participate in a study, get a movie ticket!
SGN-14006 / A.K.
I
A project in Department of Signal Processing needs
speech data for research purposes.
I
Your task is to read out simple English sentences
from a script. Takes 25 minutes.
I
Reward: a movie ticket.
How to participate?
I
I
I
I
We need two persons per recording. !
come with a friend. If you are alone, we
could try to pair you.
Sign-up via email
aleksandr.diment@tut.fi
The sessions take place on 24-28.8
during office hours, or at a different time
upon agreement.
ARK-­53806 ​Sustainable Design Studio RAK-­13106 ​Sustainable Development Studio SGN-­81006 ​Signal Processing Innovation Project PARTICIPATION: Enroll to one of the above courses and come to the ​Opening Session August 25 2015 10:00-­12:00 RO104​ where the overall description is given and the project groups will be formed. The works will be supervised by the researchers from Department of Signal Processing, School of Architecture and Department of Civil Engineering. FOR MORE INFORMATION: Harry Edelman (School of Architecture / Dept. of Civil Engineering) Joni Kämäräinen (Dept. of Signal Processing -­ video processing) Tuomas Virtanen (Dept. of Signal Processing -­ audio processing) 
Download