The Beatbox ABSTRACT The Beatbox is a real time voice-to-drum synthesizer intended primarily for the entertainment of small children. It accepts speech input limited to a small dictionary of sounds the system is pre-trained to recognize. Each sound in the dictionary has a pre-determined corresponding drumbeat, which is then played back to the user. In this manner, someone without knowledge of the drums can effectively “play” the instrument with his/her mouth. Voice-to-Drum Synthesizer The Beatbox system is comprised of three main components: The DSP unit accepts voice input, then cleans and analyzes the incoming signal The Pattern Recognition subsystem uses frequency characteristics to probabilistically determine the most likely match for the input data The Demonstration system is a GUI that controls the audio and visual feedback given to the user Digital Sampling by sound card MOTIVATION Speech recognition is a key tool in the design of the next-generation userfriendly computer application. A major obstacle remaining in the way of this goal is the detection of stop consonants, sounds created by stopping the flow of air in the mouth and letting it go into a burst (ex. b, d, g, k, p & t). Telling stop consonants apart is a difficult problem due to their similarity. Building a system to distinguish stop consonants may help bring speech recognition one step closer to reality. AUTHORS: PRIYADARSHINI ROUTH ARYEH LEVINE Digital Signal Processing Input Signal DIGITAL SIGNAL PROCESSING UNIT Windowing When a user’s voice input triggers the underlying engine, it is converted into a digital signal and passed onto the DSP unit. The signal is divided into tiny windows ~25 ms long and multiplied by the Hanning window, before its FFT is taken. Our system is modeled as a filter bank like the human ear, allowing for compression of information contained in the signals. Further redundancies are eliminated through Cepstral analysis before handing over the processed signal to the Pattern Recognition subsystem. Voice Input SYSTEM FLOW DIAGRAM Fast Fourier Transform Logarithmic Compression Critical Band Integration DSP FLOWCHART Audio Playback and Visual Feedback Pattern Recognition using Hidden Markov Models DEMONSTRATION 90% accuracy on dictionary sounds if trained on the same user. 80% accuracy if pre-existing training set is used. RAPHAEL LEVY ADVISOR: PROF. LAWRENCE SAUL PATTERN RECOGNITION UNIT Training: The recognition/ classification system is based on the theory of Hidden Markov Models (HMMs). Given the observation sequence, we infer the most likely underlying “hidden state” sequence using the Viterbi algorithm. We then iteratively estimate the parameters of the HMM till convergence is achieved. N samples of /k/ received from DSP unit Viterbi algorithm infers N state sequences TRAINING THE /K/ HMM HOW THE SYSTEM HEARS YOU! Waveforms and Spectrograms of /k/, /pff/, /t/ Cepstrum Estimate parameters of /k/ HMM Convergence Iterate till convergence Testing: Each dictionary sound has a pre-trained HMM corresponding to it. The signal is passed through each HMM; the input is classified in favor of the HMM with the highest estimated likelihood.