Slides

advertisement
Speech Recognition and Assessment
Tomer Meshorer
Agenda
This presentation describes the use
of speech recognition for:
• HCI for spastic dysarthria patients
HMM-Based and SVM-Based
Recognition of the Speech of Talkers
with Spastic Dysarthria
[M. Hasegawa-Johnson]
• Identify progression of Parkinson
disease using speech signal[A.
Tsanas]
• Auditory micro switch [G. E.
Lancioni]
Enhanced Classical Dysphonia Measures
and Sparse Regression for
Telemonitoring of Parkinson's Disease
Progression
Extending the evaluation of a computer
system used as a microswitch for word
utterances of persons with multiple
disabilities
MOTIVATION
 Dysarthria.
 Most common spastic dysarthria. Adults with cerebral




palsy which find it hard to type.
Idea : Replace keyboard with ASR.
The paper study three talkers and one control subject. All
three have spastic dysarthia due to cereblal palacy.
The subject tends to delete word initial consonants. One
subject exhibit slow stutter.
Two algorithms:
 Digit recognition using HMM
 Digit recognition using SVM.
Experiment
 Array of 8 mics, 7 were used.
 Four types of speech data:




Isolated digits
The letters in the internationl radio alphabet
Nineteen computer command
Read balanced text message (129 words) and 56
sentences(TIMIT)
 Total train data: 541 words. 395 distinct words
 Performed Intelligibility tests using 40 different words
selected from TIMIT sentences.
 Listeners are the author and two students
Results
Listener
F01
M01
M02
M03
L1
22.5%
22.5%
90%
30%
L2
17.5%
20%
90%
27.5%
L3
17.5%
15%
97.5%
30%
Avg
19.2%
19.2%
92.5%
29.2%
Listener errors
 Look at consonant position
 Three consonant positions: word-initial, word medial
and word final
 Three types of consonant errors:
 deletion (“sport” heard as “port”)
 Insertion(“on” heard as “coin”)
 Substitution(“for” heard as “bore”)
 Other errors:
 Vowel Substitution (“and” heard as “end”)
 Number of syllable could change
 The entire word can be deleted
Listener errors analysis
ASR
 Four experiments : two speaker depended
HMM and two speaker depended SVM
 HMM:
 First test:
 Test data : 19 command words + 26 letters + 10
digits.
 Train data : TIMIT sentences + grandfather passage +
utterance for each digit
 Second Test:
 Test data: only digit
 Train data: like test 1.
HMM ASR Results
 H – WRA if all micro-phone are independently recognized
 HV- WRA if micro-phone vote to determine the final system




output
Word - reports accuracy of one SVM trained to distinguish
isolated digits
WF - adds outputs of 170 binary word-feature SVM
WFV - LikeWF, but single-microphone recognizers
vote to determine system output
SVM based ASR
 Fixed length isolated word recognitions
 Tested only digits
 Two SVM were used: 10-ary SVM and binary
feature SVM.
Conclusion
 ASR can be used to recognize digits for talker




with very low intelligibility.
HMM was successful for two subjects but
failed for the subject that delete consant.
SVM was successful for two subject, but fail
for the subject with stutter.
Hence, HMM should be used when word
length flucte. SVM should be used against
deletion of consonants.
But : 10 word vocabulary is two small for HCI.
MOTIVATION
 Parkinson’s Disease (PD) is the second most common
neurodegenerative disorder after Alzheimer’s
 Strong evidence has emerged linking speech degradation with
PD progression.
 Current PD progress monitoring is achieved using empirical
tests and physical exam which is time consuming and costly.
 Results are mapped to Unified Parkinson’s Disease Rating
Scale.
 Motor-UPDRS 0-108
 Total -UPDRS 0-176 – 176 denoting total disability
 Goal: Use speech signal processing to map voice
disorder to UPDRS scores
Data
 sustained vowel speech recordings from 52 subjects with





idiopathic PD diagnosis
Subject were physically assessed and given UPDRS
scores at baseline, three months and six-months into the
trial
Subjects took tests at home weekly Intel At Home
Testing Device (AHTD).
The subject were required to sustain “ahh” for as long
and steady as possible.
Total of 5875 signals. Signal were procecced in matlab.
42 subject. Mean age:64. motor UPDRS: 20.84, Total
UPDRS: 11.52
Intel @ home
Features
 Total of 5875
signals. Signal
were processed
in matlab.
 42 subject. Mean
age:64. motor
UPDRS: 20.84,
Total UPDRS:
11.52
 Dysphonia
measures were
calculated using
praat

frequency
perturbations
 Amplitude
Perturbations
 Also added log of
each measure
Linear regression
 UPDRS values obtained at 0 ,3 and 6 months
but recording were weekly. Hence used linear
interpolation to get weekly UPDRS
 Map the feature vector x to UPDRS output y
 But ended up using Lasso regression.
Results
 Mapping performance was analyzed by
training on 5287 phonations, and testing 588.
 Used MAE – mean absolute error.
 Ui is the real value of UPDRS
 U hat, is the predicted value UPDRS
Conclusion
 Overall success in prediction (6.6 error for motor
UPDRS and (8.4 error for total UPDRS)
 Discovered during the paper that a better method
exist to measure dysphonia. And hence no need for
log transformation.
 LASSO regression here clearly shows that log
transformed classical dysphonia measures
convey superior clinical information compared to
the raw measures
G. E. Lancioni,at el
Motivation
 Students with multiple disabilities unable to
engage in constructive activity or play a
positive role in their daily context
 Want to explore the usage of verbal
utterances to exert control over
environmental events
 Microswitches are technical tools that may
help them improve their status
 Main idea: Build an utterance based
Microswitch and test it with students.
Participants
 Tania, Alex, and Dennis.
 18,27,26 years old
 Severe intellectual ability
 Alex and Dennis totally blind, while Tania can
discriminate light.
 All of them have normal hearing and can
produce number of words / short sentences
Device : Auditory Micro-switch
 Regular PC with audio output device
 Commercial available ASR (Dragon natural
speaking).
 Proprietary control program that allowed the linking
of each target utterance emitted by a participant
with the words and phrases that the commercial
software matched to it over different occurrences
 Categorize the word and phrases emitted to specific
categories based on phonetic, stracture and rule
length.
 The categories served as recognition target and
trigger for activation of stimuli.
Selection of stimulus
 Stimulus is connected to participant target




utterance.
Tania : funny story, special song
Alex: singer hit song, person whistling
Dennis: Pet voice, local music
The recognition of an utterance by the
computer system produced the stimuli
matching that utterance for 10-20 secs
Experiment
 Baseline
 Participant speak the sample of their target utterances. No
stimuli sound.
 70 times over several days
 Recording were made of the word / phrases and reference
categories were build
 Intervention
 Three groups of utterances : Tania(3,2,2 words) Alex and Dennis
(4,4,4).
 First group, Base line. Second group, base line, Third group base
line.
 10-20 min sessions. Record recognition.
 Post intervention
 2 months after intervention.
 19-22 sessions such as those occurring during intervention
Result
Summary
 About 80% of the utterances were correctly
recognized by the computer system
 Some of the utterances had a level of occurrence
significantly higher (P < 0.01) than that expected by
chance .
 Computer system was an adequate microswitch for
the participants’ word utterances
 the use of the system can be considered a valuable
strategy to increase the participants’ constructive
verbal engagement and to allow their selfdetermination in seeking positive environmental
stimulation
Download