Speech Recognition and Assessment Tomer Meshorer Agenda This presentation describes the use of speech recognition for: • HCI for spastic dysarthria patients HMM-Based and SVM-Based Recognition of the Speech of Talkers with Spastic Dysarthria [M. Hasegawa-Johnson] • Identify progression of Parkinson disease using speech signal[A. Tsanas] • Auditory micro switch [G. E. Lancioni] Enhanced Classical Dysphonia Measures and Sparse Regression for Telemonitoring of Parkinson's Disease Progression Extending the evaluation of a computer system used as a microswitch for word utterances of persons with multiple disabilities MOTIVATION Dysarthria. Most common spastic dysarthria. Adults with cerebral palsy which find it hard to type. Idea : Replace keyboard with ASR. The paper study three talkers and one control subject. All three have spastic dysarthia due to cereblal palacy. The subject tends to delete word initial consonants. One subject exhibit slow stutter. Two algorithms: Digit recognition using HMM Digit recognition using SVM. Experiment Array of 8 mics, 7 were used. Four types of speech data: Isolated digits The letters in the internationl radio alphabet Nineteen computer command Read balanced text message (129 words) and 56 sentences(TIMIT) Total train data: 541 words. 395 distinct words Performed Intelligibility tests using 40 different words selected from TIMIT sentences. Listeners are the author and two students Results Listener F01 M01 M02 M03 L1 22.5% 22.5% 90% 30% L2 17.5% 20% 90% 27.5% L3 17.5% 15% 97.5% 30% Avg 19.2% 19.2% 92.5% 29.2% Listener errors Look at consonant position Three consonant positions: word-initial, word medial and word final Three types of consonant errors: deletion (“sport” heard as “port”) Insertion(“on” heard as “coin”) Substitution(“for” heard as “bore”) Other errors: Vowel Substitution (“and” heard as “end”) Number of syllable could change The entire word can be deleted Listener errors analysis ASR Four experiments : two speaker depended HMM and two speaker depended SVM HMM: First test: Test data : 19 command words + 26 letters + 10 digits. Train data : TIMIT sentences + grandfather passage + utterance for each digit Second Test: Test data: only digit Train data: like test 1. HMM ASR Results H – WRA if all micro-phone are independently recognized HV- WRA if micro-phone vote to determine the final system output Word - reports accuracy of one SVM trained to distinguish isolated digits WF - adds outputs of 170 binary word-feature SVM WFV - LikeWF, but single-microphone recognizers vote to determine system output SVM based ASR Fixed length isolated word recognitions Tested only digits Two SVM were used: 10-ary SVM and binary feature SVM. Conclusion ASR can be used to recognize digits for talker with very low intelligibility. HMM was successful for two subjects but failed for the subject that delete consant. SVM was successful for two subject, but fail for the subject with stutter. Hence, HMM should be used when word length flucte. SVM should be used against deletion of consonants. But : 10 word vocabulary is two small for HCI. MOTIVATION Parkinson’s Disease (PD) is the second most common neurodegenerative disorder after Alzheimer’s Strong evidence has emerged linking speech degradation with PD progression. Current PD progress monitoring is achieved using empirical tests and physical exam which is time consuming and costly. Results are mapped to Unified Parkinson’s Disease Rating Scale. Motor-UPDRS 0-108 Total -UPDRS 0-176 – 176 denoting total disability Goal: Use speech signal processing to map voice disorder to UPDRS scores Data sustained vowel speech recordings from 52 subjects with idiopathic PD diagnosis Subject were physically assessed and given UPDRS scores at baseline, three months and six-months into the trial Subjects took tests at home weekly Intel At Home Testing Device (AHTD). The subject were required to sustain “ahh” for as long and steady as possible. Total of 5875 signals. Signal were procecced in matlab. 42 subject. Mean age:64. motor UPDRS: 20.84, Total UPDRS: 11.52 Intel @ home Features Total of 5875 signals. Signal were processed in matlab. 42 subject. Mean age:64. motor UPDRS: 20.84, Total UPDRS: 11.52 Dysphonia measures were calculated using praat frequency perturbations Amplitude Perturbations Also added log of each measure Linear regression UPDRS values obtained at 0 ,3 and 6 months but recording were weekly. Hence used linear interpolation to get weekly UPDRS Map the feature vector x to UPDRS output y But ended up using Lasso regression. Results Mapping performance was analyzed by training on 5287 phonations, and testing 588. Used MAE – mean absolute error. Ui is the real value of UPDRS U hat, is the predicted value UPDRS Conclusion Overall success in prediction (6.6 error for motor UPDRS and (8.4 error for total UPDRS) Discovered during the paper that a better method exist to measure dysphonia. And hence no need for log transformation. LASSO regression here clearly shows that log transformed classical dysphonia measures convey superior clinical information compared to the raw measures G. E. Lancioni,at el Motivation Students with multiple disabilities unable to engage in constructive activity or play a positive role in their daily context Want to explore the usage of verbal utterances to exert control over environmental events Microswitches are technical tools that may help them improve their status Main idea: Build an utterance based Microswitch and test it with students. Participants Tania, Alex, and Dennis. 18,27,26 years old Severe intellectual ability Alex and Dennis totally blind, while Tania can discriminate light. All of them have normal hearing and can produce number of words / short sentences Device : Auditory Micro-switch Regular PC with audio output device Commercial available ASR (Dragon natural speaking). Proprietary control program that allowed the linking of each target utterance emitted by a participant with the words and phrases that the commercial software matched to it over different occurrences Categorize the word and phrases emitted to specific categories based on phonetic, stracture and rule length. The categories served as recognition target and trigger for activation of stimuli. Selection of stimulus Stimulus is connected to participant target utterance. Tania : funny story, special song Alex: singer hit song, person whistling Dennis: Pet voice, local music The recognition of an utterance by the computer system produced the stimuli matching that utterance for 10-20 secs Experiment Baseline Participant speak the sample of their target utterances. No stimuli sound. 70 times over several days Recording were made of the word / phrases and reference categories were build Intervention Three groups of utterances : Tania(3,2,2 words) Alex and Dennis (4,4,4). First group, Base line. Second group, base line, Third group base line. 10-20 min sessions. Record recognition. Post intervention 2 months after intervention. 19-22 sessions such as those occurring during intervention Result Summary About 80% of the utterances were correctly recognized by the computer system Some of the utterances had a level of occurrence significantly higher (P < 0.01) than that expected by chance . Computer system was an adequate microswitch for the participants’ word utterances the use of the system can be considered a valuable strategy to increase the participants’ constructive verbal engagement and to allow their selfdetermination in seeking positive environmental stimulation