to an Automatic Speech Recognition System? Jiang Wu jiang.wu@binghamton.edu Electrical Engineering Department Say if you are only allowed to use 39 values to represent a speech seg of 1 sec long… What are good features? ◦ Discriminative ◦ “Curse of Dimensionality” For each frame of prerecorded speech, we try to extract the feature as to compress its spectrum. 0.15 0.1 BV0 0.05 Amplitude 0 -0.05 -0.1 BV1 BV2 -0.15 -0.2 0 2 4 Frequency [kHz] 6 8 Most recent research has shown that spectral trajectories, over time, also play an important role in ASR. 0.2 0.15 0.1 0.05 Amplitude BV0 0 -0.05 BV2 -0.1 Thus, we also want to let computers see what happens over time, about the center of each static feature. BV1 -0.15 -0.2 -0.25 1 -60 -40 -20 0 20 Time [ms] 40 60 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 5 10 0 40 30 10 20 20 10 30 0 20 15 15 20 10 25 5 30 0 Original Spectrogram Rebuilt Spectrogram 8000 8000 7000 7000 6000 5000 Frequency (Hz) Frequency (Hz) 6000 4000 3000 5000 4000 3000 2000 2000 1000 0 0.5 1000 0.6 0.7 0.8 0.9 1 1.1 Time (Sec) 1.2 1.3 1.4 1.5 0 0.5 0.6 0.7 0.8 0.9 1 1.1 Time (Sec) 1.2 1.3 1.4 1.5 Speech waveform ◦ Strongly related to “tones” ◦ Very popular feature type for tonal languages: Mandarin, Cantonese, Some of Korean dialects, etc. Amplitude 2000 0 -2000 -4000 0 0.5 1 1.5 Time (seconds) Pitch 0 0.5 1 1.5 Time (seconds) Spectrogram 0.5 1 1.5 Time 2 2.5 3 300 Frequency (Hz) Pitch Contour : 200 100 0 2 2.5 8000 Frequency (Hz) 4000 6000 4000 2000 0 2 2.5 “Perceptual Features: ◦ Analyze speech signal as to how human’s auditory system “perceptually” process sound ◦ Frequency resolution and time resolution both depend on frequency and time.. 3 Project 1: To create an open source multilanguage audio database for spoken language processing applications. Project 2: To understand tonal languages. Dr. Montri Karnjanadecha Chandra Vootkuri (Ph.D.) Brian Wong (M.S.) Andrew Hwang(M.S.) (Force Alignment) (Landmark Theory) (Freq. Non-linearity) (Feature Transform) ◦ Signal processing (A/D) ◦ Probability theory, pattern recognition and machine learning ◦ Understanding of human auditory system/ linguistic/musicality will be a bonus! Tons of interesting applications!! ◦ Pronunciation therapy ◦ Hearing aids ◦ Singing voice processing + ◦ Not just the Microsoft speech-text software on your PC.. Questions?