PPT - SEAS

advertisement
The Beatbox
ABSTRACT
The Beatbox is a real time voice-to-drum
synthesizer intended primarily for the
entertainment of small children. It accepts
speech input limited to a small dictionary
of sounds the system is pre-trained to
recognize. Each sound in the dictionary
has a pre-determined corresponding
drumbeat, which is then played back to
the user. In this manner, someone without
knowledge of the drums can effectively
“play” the instrument with his/her mouth.
Voice-to-Drum Synthesizer
The Beatbox system is comprised of three main components:

The DSP unit accepts voice input, then cleans and
analyzes the incoming signal

The Pattern Recognition subsystem uses frequency
characteristics to probabilistically determine the most
likely match for the input data

The Demonstration system is a GUI that controls the
audio and visual feedback given to the user
Digital Sampling
by sound card
MOTIVATION
Speech recognition is a key tool in the
design of the next-generation userfriendly computer application. A major
obstacle remaining in the way of this goal
is the detection of stop consonants,
sounds created by stopping the flow of air
in the mouth and letting it go into a burst
(ex. b, d, g, k, p & t). Telling stop
consonants apart is a difficult problem due
to their similarity. Building a system to
distinguish stop consonants may help
bring speech recognition one step closer
to reality.
AUTHORS: PRIYADARSHINI ROUTH
ARYEH LEVINE
Digital Signal
Processing
Input Signal
DIGITAL SIGNAL PROCESSING UNIT
Windowing
When a user’s voice input triggers
the underlying engine, it is
converted into a digital signal and
passed onto the DSP unit. The
signal is divided into tiny windows
~25 ms long and multiplied by the
Hanning window, before its FFT is
taken. Our system is modeled as a
filter bank like the human ear,
allowing for compression of
information contained in the signals. Further
redundancies are eliminated through Cepstral
analysis before handing over the processed signal
to the Pattern Recognition subsystem.
Voice Input
SYSTEM FLOW
DIAGRAM
Fast
Fourier
Transform
Logarithmic
Compression
Critical
Band
Integration
DSP FLOWCHART
Audio Playback
and
Visual Feedback
Pattern
Recognition using
Hidden Markov
Models
DEMONSTRATION
90% accuracy on dictionary sounds
if trained on the same user.
80% accuracy if pre-existing
training set is used.
RAPHAEL LEVY
ADVISOR: PROF. LAWRENCE SAUL
PATTERN RECOGNITION UNIT
Training: The recognition/ classification system is based on the
theory of Hidden Markov Models (HMMs). Given the observation
sequence, we infer the most likely underlying “hidden state”
sequence using the Viterbi algorithm. We then iteratively
estimate the parameters of the HMM till convergence is achieved.
N samples of
/k/ received
from DSP unit
Viterbi
algorithm
infers N state
sequences
TRAINING THE /K/ HMM
HOW THE SYSTEM HEARS YOU!
Waveforms and Spectrograms of /k/, /pff/, /t/
Cepstrum
Estimate
parameters of
/k/ HMM
Convergence
Iterate till
convergence
Testing: Each dictionary sound has a pre-trained HMM
corresponding to it. The signal is passed through each HMM;
the input is classified in favor of the HMM with the highest
estimated likelihood.
Download