Background: The paper-based FDA The type and severity of a given instance of dysarthria (disordered speech arising from impaired articulator control) is diagnosable by an assessment procedure known as the Frenchay Dysarthria Assessment (FDA) tests. Two of the three FDA intelligibility tests are concerned with the measurement of intelligibility…but what exactly is intelligibility anyway? “The degree of success in establishing communication between the sender and intended recipient of a message” Intelligibility, a very variable percept Are both of these speech samples equally intelligible? (from ABI Corpus, Birmingham Uni.) Initially, a listener will find it more difficult to understand a Learning Effect from Repeated Exposure to Dysarthric Speech Data - Mean Score Improvement: Round 1 vs. Rounds 2-5 Mean Score Improvement (%) newly encountered accent than a familiar one. Nonetheless, increased exposure to the initially unfamiliar speaking style will usually invoke a subconscious adaptation, a learning effect, making that speech easier to understand. This holds true even for dysarthric speech. 50 45 40 35 30 25 20 15 10 5 0 Naïve Listeners Expert Listeners Judge Judge Judge Judge Judge Judge Judge Judge Judge Judge 1 2 3 4 5 6 7 8 9 10 Modelling the Naïve Listener If the learning effect alters a listener’s perception of a particular individual’s speaking style, is that listener’s judgement still representative of the naïve listener? • If the learning effect introduces an inevitable bias, can a computer model be built which behaves like an “eternal” naïve listener (i.e. never adapting to an unfamiliar speaking style and therefore always consistent in assessment)? Possible Solution: Using HMM Models to Emulate the Naïve listener • A hidden Markov Model (HMM) is, essentially, a statistical representation of a speech unit at the phone/word/utterance level. HMM models are “trained” by analysing the acoustic features of multiple utterances representing the specified speech unit. Multiple Speech Samples from multiple speakers Goodness of Fit Once trained, an HMM word model can be used to estimate the likelihood that a given speech sound could have actually been produced by that word model. This likelihood is called a goodness of fit (GOF) and can be expressed as a log likelihood, e.g. 10-35 (or simply -35). Mr. HMM Model, could you’ve been my daddy? The more acoustically dissimilar an utterance is from what the IE has been trained on, the lower the GOF score Hmm, with a log likelihood of 1055, I’m not so sure… Using Forced-Alignment GOF scoring to measure Intelligibility Since two of the FDA intelligibility tests require the repetition of words/phrases from a pre-selected vocabulary, HMM utterance models can be built for these words/phrases. Furthermore, the incoming speech can be matched to the corresponding utterance model to determine the goodness of fit. This matching of a speech sample to a specific utterance model and only that model is called forced alignment. We hypothesise that force-aligning a speech sample with its corresponding “everyman” word model will yield GOF scores which are systematically related to that speech sample’s intelligibility. When HMMs are used in this way, we call them intelligibility estimators. …so, how does it work in practice? IE utterance models are trained on normal speech from a variety of speakers and a range of GOF scores for normal speech test data is established: typically between -5 and -10. Ranges have been established for moderate and low intelligibility (which, in an FDA diagnostic context = dysarthric) speech, typically with GOF scores between -11 and -20 (moderately intelligible) and < -20 (low intelligibility). These scores are relative to the maximum likelihood utterance (i.e. the speech file with the highest GOF score) in the IE’s training set. Sample GOF scores 5 Normal Speaker 1 Normal Speaker 2 Normal Speaker 3 -5 GOF scores for isolated single words Normal Speaker 4 Normal Speaker 5 -10 Normal Speaker 6 -15 Normal Speaker 7 Normal Speaker 8 -20 Normal Speaker 9 Normal Speaker 10 -25 Dysar. Speaker 1 Dysar. Speaker 2 -30 Dysar. Speaker 3 -35 5 -40 0 GOF scores for short sentence utterances Normalised GOF Scores (Relative to MLU) Normalised GOF Scores (Relative to MLU) 0 Normal Speaker 1 Normal Speaker 2 -5 Normal Speaker 3 Normal Speaker 4 -10 -15 Normal Speaker 5 Normal Speaker 6 Normal Speaker 7 -20 Normal Speaker 8 Normal Speaker 9 -25 -30 Normal Speaker 10 Dysar. Speaker 1 Dysar. Speaker 2 -35 -40 -45 Dysar. Speaker 3 Problem: How do we make IEs truly naïve? “Everyman’ HMM utterance models are not really ‘everyman’, it’s not feasible to train them on speech data representing all the world’s anglophone accents. In this experiment, the utterance models have been trained on speech principally from the South Yorkshire region, thus accents not represented in the HMM training data could receive GOF scores which do not truly reflect that speech sample’s intelligibility as perceived by a naïve listener. A non-trivial problem: Certain anglophone accents, due to their prestige, are more universally intelligible than others, e.g. Estuary English and RP, while others are a lot less intelligible internationally (e.g. the Glaswegian accent). What mix of accents should be used to train an HMM word model to make it truly representative of a ‘typical’ naïve listener? Objective #2: Overall Diagnosis After collecting data from all the 28 FDA sub-tests, how do we arrive at a dysarthria sub-type diagnosis? Usually by template matching and symptom categorisation (e.g. “At-rest tasks performed better than in-speech tasks? If so, spastic dysarthria most likely”). Can these processes be automated? Yes, via a neural network combined with an expert system. The neural network does the basic pattern matching while the rule-based expert system attempts to disambiguate diagnostic information not directly represented in the FDA letter grades. Example of CFDA Expert system rulebased data disambiguation Uncontrollably Rapid Speech Rate? Yes No Hypokinetic Dysarthria Most likely of 5 types Slow Speech Rate? No Flaccid Dysarthria most likely of 5 types Yes Extrapyramidal Dysarthria less likely than other 4 types Diagnostic Accuracy of Hybrid System Classification accuracy (%) 100 FDT Classification Correctness 80 MLP Classification Correctness 60 40 20 0 Ataxic Extrapyramidal Flaccid Dysarthria sub-type Mixed Spastic Hybrid System Classification Correctness (1st choice) Hybrid System Classification Correctness (1st or 2nd choice) The automated diagnostic system will even tell you why it came to a given decision… Future Work Acquisition of HMM Technology which (for the Intelligibility Estimator) doesn’t have prohibitively high license fees. Collection of dysarthric data to build an FDAspecific dysarthric speech database. More interviews with experienced speech therapists to increase the diagnostic expert system’s knowledge database. Results of NHS Field Trials of the CFDA application