Uploaded by jinggv4932a

10.1121 1.5062838

advertisement
Importance of glottis landmarks for the assessment of cleft lip and palate speech
intelligibility
Sishir Kalita, S. R. Mahadeva Prasanna, and S. Dandapat
Citation: The Journal of the Acoustical Society of America 144, 2656 (2018); doi: 10.1121/1.5062838
View online: https://doi.org/10.1121/1.5062838
View Table of Contents: http://asa.scitation.org/toc/jas/144/5
Published by the Acoustical Society of America
Articles you may be interested in
Simultaneous electromagnetic articulography and electroglottography data acquisition of natural speech
The Journal of the Acoustical Society of America 144, EL380 (2018); 10.1121/1.5066349
Age-related perceptual difference of Kyungsang Korean accent contrast: A diachronic observation
The Journal of the Acoustical Society of America 144, EL367 (2018); 10.1121/1.5066344
Cross-language differences in how voice quality and f0 contours map to affect
The Journal of the Acoustical Society of America 144, 2730 (2018); 10.1121/1.5066448
Sound field reconstruction using multipole equivalent source model with un-fixed source locations
The Journal of the Acoustical Society of America 144, 2674 (2018); 10.1121/1.5064784
Similar abilities of musicians and non-musicians to segregate voices by fundamental frequency
The Journal of the Acoustical Society of America 142, 1739 (2017); 10.1121/1.5005496
Nonlinear ultrasound parameter to monitor cell death in cancer cell samples
The Journal of the Acoustical Society of America 144, EL374 (2018); 10.1121/1.5066348
Importance of glottis landmarks for the assessment of cleft lip
and palate speech intelligibility
Sishir Kalita,a) S. R. Mahadeva Prasanna,b) and S. Dandapat
Department of Electronics and Electrical Engineering, Indian Institute of Technology Guwahati, Guwahati,
Assam 781039, India
(Received 15 May 2018; revised 20 September 2018; accepted 22 September 2018; published
online 6 November 2018)
The present work explores the acoustic characteristics of articulatory deviations near g(lottis) landmarks to derive the correlates of cleft lip and palate speech intelligibility. The speech region around
the g landmark is used to compute two different acoustic features, namely, two-dimensional discrete
cosine transform based joint spectro-temporal features, and Mel-frequency cepstral coefficients.
Sentence-specific acoustic models are built using these features extracted from the normal speakers’
group. The mean log-likelihood score for each test utterance is computed and tested as the acoustic
correlates of intelligibility. Derived intelligibility measure shows significant correlation (q ¼ 0.78,
C 2018 Acoustical Society of America.
p < 0.001) with the perceptual ratings. V
https://doi.org/10.1121/1.5062838
[BHS]
Pages: 2656–2661
I. INTRODUCTION
Speech intelligibility of individuals with cleft lip and
palate (CLP) is primarily degraded due to (i) hypernasality,
(ii) articulation errors, (iii) nasal air emission, and (iv) voice
disorders.1 It is considered as an essential measure to evaluate the speech outcome of different interventions and to estimate the overall speech production capability of the CLP
children. In the clinical environment, speech-language pathologists (SLPs) assess the speech intelligibility using
auditory-perceptual evaluation based subjective methods.2
Perceptual evaluation has several inherent shortcomings,
such as biased judgment, intra-rater and inter-rater variabilities, the requirement of trained SLPs, and time-consuming
process.3 The acoustic analysis of speech, which utilizes the
production knowledge, may supplement the perceptual
assessment by presenting consistent results across speakers.4
Currently, researchers have shown the significance of
automatic speech recognition (ASR) techniques to quantify
the intelligibility of CLP speech.3,5,6 In these approaches,
word error rate is considered to quantify the speech intelligibility, and a significant correlation is observed with the SLPs
perceptual scores. However, a large amount of annotated
data is needed to build the acoustic and language models for
ASR based systems. Recently, landmark (LM) based speech
analysis is gaining research interest to evaluate the pathological speech.4,7,8 Researchers have shown the potentiality of
LM based speech analysis to derive the biomarker for speech
intelligibility, by characterizing the expression of LMs.4,7
LMs may guide the extraction of the important acoustic features around some specific locations of speech signal, where
correlates of articulatory deviations are more salient. Despite
the encouraging findings, limited works have been reported
a)
Electronic mail: sishir@iitg.ac.in
Also at Department of Electrical Engineering, Indian Institute of
Technology Dharwad, Dharwad, Karnataka 580011, India.
b)
2656
J. Acoust. Soc. Am. 144 (5), November 2018
in the literature in this direction. Besides, no attempts have
been made to analyze the CLP speech using the LMs.
LMs are defined as the time locations of abrupt acoustic
events, which are correlated with the major articulatory
movements.4,7,9,10 Three types of abrupt acoustic LMs are
defined for consonants: g(lottis), b(urst), and s(onorant).9
However, this work only explores the g LMs to show that
they can be utilized to derive acoustic correlates of CLP
speech intelligibility. The g LMs are denoted as þg and g,
which represent the starting and ending locations of vocal
folds’ free vibration, respectively.9 The g LMs distinguish
obstruent consonants from the vowels or sonorant consonants, and the vocalic transition from obstruent to these
sounds and vice versa are associated with the þg and g
LMs, respectively.11 Such abrupt vocalic transition regions
contain important perceptual cues to identify the
consonants.11
In CLP speech, the obstruents exhibit highly deviant
characteristics due to the inadequate buildup of intra-oral
pressure.1,12,13 The production of glottal and pharyngeal consonants, velarization of labial and palatal obstruents, nasalized consonants, weak obstruents, and nasal fricatives are
the primary maladaptive articulations compensated for the
obstruents.1,3,5 These maladaptive articulation patterns are
one of the primary factors to deteriorate the CLP speech
intelligibility. These wrong articulation patterns result in distortion of spectro-temporal dynamics at the vicinity of the g
LMs. Therefore, the acoustic features around g LMs may
carry information about the deviant speech production in
CLP speakers. It is hypothesized that analysis of speech
region anchored around g LMs may anticipate the degree of
intelligibility loss in CLP speech. The b(urst) LMs, which
signify an affricate or aspirated stop burst and the offset of
frication or aspiration noise due to a stop closure, may not be
fired all the time in CLP speech due to the inadequate
buildup of intra-oral pressure. The detection of the s(onorant) LMs is more complicated, and detection rate is abysmal
0001-4966/2018/144(5)/2656/6/$30.00
C 2018 Acoustical Society of America
V
as compared to the g LMs.9 Moreover, in the database, a
very small number of nasals and approximants are present in
the speech stimuli.
The primary objective of the present work is to analyze
how the acoustic information extracted from speech region
around the g LMs can be used to derive correlates of CLP
speech intelligibility. To investigate this, the acoustic feature
which gives the explicit representation of temporal dynamics
between two sounds is computed at the vicinity of both g
LMs. Two separate sentence-specific Gaussian mixture models
(GMMs) are built for each sentence stimuli, using the features
extracted around þg and g LMs. Speech utterances from the
normal speakers’ group are used to train the GMMs. GMMs
derived for each sentence-level stimuli are used to compute
the log-likelihood scores of the respective test utterance. The
average value of the log-likelihood scores of features extracted
from the speech region around g LMs is calculated. Since two
separate GMMs are built for þg and g for each sentence
stimuli, two average value of log-likelihood scores are
obtained for each test utterance. Both the scores are studied as
the acoustic correlates of CLP speech intelligibility.
II. METHODS
A. Database and perceptual evaluation
Speech samples of both CLP and healthy groups were
recorded at All Indian Institute of Speech and Hearing
(AIISH) in Mysore, India. All the children with cleft had
undergone primary surgery and did not have other congenital
disorders and developmental problems. CLP children with
adequate language abilities were only considered for the
study. In this work, 41 children (16 girls and 25 boys) with
CLP belonging to the age range of 6–11 yrs were considered,
whereas 40 normal children (20 girls and 20 boys) with
matched age, having proper speech and language characteristics, were served as controls for the study. Before the
recording, ethical consent was obtained from the parents of
each group of speakers.
We used ten phonetically balanced sentences rich in
obstruents as given in Table I. These sentences were
designed by the SLPs of AIISH, especially for intelligibility
TABLE I. Description of sentence-level stimuli used for intelligibility
assessment [written in International Phonetic Alphabet (IPA)].
S1 ka+ge ka+lu kappu, S2 gi+t9a be+ga ho+gu, S3 d9ana d9a+ri t9appit9u,
S4 ba+lu t9abala ba+risu, S5 be+˜a ka+˜ige o+˜id9a, S6 sarit9a kat9t9ari t9a+,
S7 Sivana u+ru ka+Si, S8 Ta+Ta+ Tapa+t9i ko˜u, S9 paa paa bha+vua,
S10 t9a+t9a t9abala t9a+
evaluation of Kannada speaking CLP individuals. Speech
samples were recorded in a soundproof room using a directional microphone (Bruel & Kjær, Nærum, Denmark) with a
sampling frequency of 44 kHz and 16-bit resolution on a
mono channel. The database consists of around 800 (2
sessions 40 speakers 10 sentences) CLP speech utterances, while around 820 (2 sessions 42 speakers 10 sentences) normal speech utterances.
Three SLPs of AIISH, Mysore having around 5 years of
experience in the field of CLP speech evaluation assessed
the sentence-level intelligibility by a perceptual evaluation
method. The auditory-perceptual evaluation was conducted
in a soundproof room. All three SLPs used the same computer setup to listen to the samples. The stimuli were presented through headphones, and the intensity held consistent
among all the raters. SLPs were allowed to listen to one sample as many times before making the decision. All three
SLPs rated every utterance in the database. The order of the
samples presented for the evaluation was randomized. The randomization was done in the sentence level as well as speaker
level. The evaluation was performed in three sessions and
accomplished in two conjugative days. SLPs used a 4-point
equal-appearing-interval scale to rate the intelligibility score of
each sentence. The scale is ranged from 0 to 3, where, 0 ¼ near
to normal, 1 ¼ mild, 2 ¼ moderate, and 3 ¼ severe. For reference purpose, speech files with different intelligibility levels
(ILs) are provided in the link: https://drive.google.com/drive/
folders/1m6VeY09IAuyuCw46e3sRhwJJowEeKH80?usp¼
sharing. Also, these speech files are used to generate Fig. 1.
B. Detection of g LMs
The detection of þg and g LMs are based on the work
proposed by Liu in Ref. 9. Initially, the wideband short-term
FIG. 1. (Color online) Time waveforms with þg and g LMs, spectrograms, and log-likelihood scores at detected þg and g LMs of target sentence S1 for
normal [(a), (b), and (c)], CLP IL 0 [(d), (e), (f)], CLP IL 1 [(g), (h), and (i)], CLP IL 2 [(j), (k), and (l)], and CLP IL 3 [(m), (n), and (o)], respectively. All
vowels and lateral liquids of CLP speech are nasalized. Upper dashed arrows and down solid arrows represent the þg LMs and g LMs, respectively. Dashed
rectangles represent the locations of target /g/ phoneme.
J. Acoust. Soc. Am. 144 (5), November 2018
Kalita et al.
2657
Fourier transform (STFT) is computed from speech signal
with a window size of 5 ms and a shift of 1 ms. From the
derived wideband STFT, the frequency band of range
0–400 Hz is extracted.9 The energy in the band is computed by
averaging the square magnitude of STFT over the corresponding frequency band. The computed band energy is passed
through a two-pass system: fine and coarse to avoid the noise
and to get the high time-resolution. The þg LM corresponds
to an abrupt (6 dB or more) energy increase in the band energy,
while g LM corresponds to a sharp energy decrease in the
same band.9 The detected g LMs are considered as the
anchored points in the speech signal around which acoustic
features are extracted.
C. Feature extraction and acoustic modeling
This work explores two features, namely Mel-frequency
cepstral coefficients (MFCCs) and two-dimensional discrete
cosine transform (2D-DCT) based joint spectro-temporal
features (JSTFs) to capture the acoustic deviations around
þg and g LMs of the CLP speech. Initially, all the speech
samples are down-sampled to 16 kHz and pre-emphasized
with a factor of 0.97. For the computation of MFCCs, the
speech signal is short-term processed with a Hanning window of size 15 ms and shift of 5 ms. The Fourier transform is
computed for each short-term speech frame. Then, the
Fourier magnitude spectrum is passed through the Mel-filter
bank of 40-filters, and the DCT of the log magnitude of Melfilter bank output is computed. The first 13 dimensions
which represent the compressed cepstral representation of
the speech frame are termed as the MFCCs. Along with the
base 13-dimensional MFCCs (excluding C0 coefficient), D
and DD variants are also augmented, which result in a 39dimensional MFCC feature vector. See the supplemental
material for a detailed procedure to extract MFCC features
anchored around g LMs.14
The motivation to use the JSTFs is that they can better
capture the spectral and temporal modulations present in the
transition region as compared to MFCCs, especially in the
case of obstruents.15 The intelligibility of CLP speech
degrades primarily due to the articulation problems of
obstruents. Therefore, it is expected that JSTFs may model
the articulation deviations more effectively than MFCCs.16
To compute the 2D-DCT based JSTFs the Fourier magnitude
spectrum is estimated from the windowed speech which is
then processed by the Mel-filter banks.16 Log-magnitude of
each Mel-spectral energy vector is computed, and such vectors are stacked temporally to form a matrix representation,
termed as Mel time-frequency representation (Mel-TFR) in
this paper. This Mel-TFR is used to extract the JSTFs. The
overlapping 2D patches of Mel-TFR are extracted and projected onto the 2D-DCT basis. For each 2D-patch, the spectral and temporal extends are the number of Mel-filter banks
and the number of frames related to 40 ms speech segments,
respectively. 2D-DCT coefficients encode the spectrotemporal dynamics embedded in the 2D-patch. Later, loworder 2D-DCT coefficients (13 horizontal and 3 vertical) are
considered, which provide a compact representation of the
2D-patch. The resultant matrix is converted to a 392658
J. Acoust. Soc. Am. 144 (5), November 2018
dimensional (13 3) vector, which is termed the Mel2DDCT feature in the present work. See the supplemental
material for a detailed procedure to extract Mel-2DDCT features anchored around g LMs.14
For each utterance, þg and g LMs are detected. Then,
around each þg LM, a region of 80 ms (40 ms before and
40 ms after) is considered to derive the features, so that the
obstruent characteristics and information of formant transition in the adjacent sonorant sound can be captured. While
around the g LM, a segment of 40 ms (40 ms before) is
considered to derive features from the offset transition
region. For each sentence-level stimuli, features derived
from the regions around þg and g LMs are used to build
two separate GMMs. These two GMMs represent the acoustic space of the corresponding sentence stimuli.
For testing an utterance, features are extracted from the
speech region around the þg and g LMs. Then, the loglikelihood scores for the extracted features are derived from
the respective GMM. Let us consider the feature extracted
from the speech segments around the þg LMs of test S1 sentence denoted as FS1
þg . Then log-likelihood scores for the feaS1
ture FS1
þg are computed from the respective GMM (kþg ), i.e.,
the GMM built using the features extracted around the þg
LMs from the normal version of S1 sentences. A similar process is applied to compute the log-likelihood scores for the
features computed around the g LMs. For each test utterance, two average values of likelihood scores are computed,
one for the features around þg LM and another for g LM.
These two mean likelihood scores are considered as the
acoustic correlates of the CLP speech intelligibility. The process log-likelihood score computation is similar for all ten
sentence-level stimuli used herein.
III. RESULTS AND DISCUSSION
In this section, we discuss the experimental results and
performance evaluation of the proposed g LM based intelligibility assessment technique. Initially, we have assessed the
inter-rater agreement of the intelligibility ratings. To compare the rating agreement, we computed Fleiss’ kappa which
is used to assess the agreement between more than two
raters. We found j ¼ 0.62 (p < 0.001) with Confidence
Interval (95%; 0.59, 0.66), which is quite reliable to consider
as the ground truth (p < 0.01). The median value of the three
rater’s scores was considered as the ground truth for current
work. Table II provides a detailed description of the number
of CLP individuals that belong to each IL for the respective
sentence-level stimuli. Then, we have analyzed the expression of g LMs in CLP speech and studied how the acoustic
TABLE II. Details of speakers in each IL for each sentence-level stimuli.
# sentence
Intelligibility rating
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
0
1
2
3
8
11
15
7
9
11
15
6
8
10
14
9
7
13
13
8
8
12
14
7
7
8
15
11
7
11
13
10
6
11
13
12
7
9
14
11
7
11
15
7
Kalita et al.
characteristics near these LMs are deviated by considering
one particular sentence stimuli as a case study. Later the performance evaluation of the proposed acoustic correlates of
the CLP speech intelligibility is discussed.
A. Analysis of g LMs in CLP speech
Figures 1(a)–1(b), 1(d)–1(e), 1(g)–1(h), 1(i)–1(k), and
1(m)–1(n) show speech signal corresponding to the target S1
sentence (see Table I) with þg and g LMs and spectrograms for normal, CLP 0, CLP 1, CLP 2, and CLP 3, respectively. In the normal speech signal, five þg and five g LMs
are detected, and all the þg and g LMs are associated with
the obstruent to vowel and the vowel to obstruent transition
region, respectively. However, the number of g LMs
detected for the speech signal of CLP 0 is reduced, as phoneme /g/ is heavily voiced [dashed rectangle in Fig. 1(d),
around 0.2 s]. The acoustic characteristics near the þg and
g LMs of CLP 0 are almost similar to normal, which can
be seen from the respective spectrogram in Fig. 1(e).
Though the number of g LMs in CLP 1 and CLP 2 speech is
the same as normal, the acoustic characteristics near LMs
deviate from that of normal. The acoustic characteristics represent the static and dynamics cues related to obstruents production. In CLP 1 speech, phoneme /g/ is replaced by an
unvoiced sound [dashed rectangle in Fig. 1(g), around 0.3 s],
while in CLP 2, it is compensated by a glottal stop [dashed
rectangle in Fig. 1(j), around 0.3 s], which distorts the voice
bar and formant transitions in the adjacent vowels. In the
speech signal of CLP 3, all the obstruents are replaced by the
nasal consonants, which results in only one þg and one g
LMs [Fig. 1(m)]. Thus, apart from the deviations in the LM
expression, the acoustic correlates of stop production, such
as burst energy and formant transitions in the consonantvowel and vowel-consonant transition regions are also distorted. Therefore, the analysis of acoustic features around
þg and g LMs may provide the degree of loss in intelligibility. Figures 1(c), 1(f), 1(i), 1(l), and 1(o) show the loglikelihood scores around the þg and g locations in the case
of sentence S1 for normal, CLP 0, CLP 1, CLP 2, and CLP
3, respectively. Log-likelihood scores are computed from the
TABLE III. Counts of the þg and g LMs in normal and different ILs of
CLP speech in the case of sentence S1.
# þg LM
# g LM
Normal
CLP 0
CLP 1
CLP 2
CLP 3
29
29
28
28
30
30
31
31
28
28
Mel-2DDCT based þg and g models around the þg and
g LMs, respectively. As the intelligibility degrades from
mild to severe, the log-likelihood scores get decreased
accordingly. A careful observation of the log-likelihood
scores around the g LMs of CLP speech provides some
information about the localized articulatory deviation near
them. However, the current work only exploits the global
deviations of the log-likelihood scores for each g LM. Future
exploration is required to investigate the localized loglikelihood scores.
We have also shown how counts of the þg and g LMs
are varied regarding the intelligibility degradation, and this
is done only for the sentence-level stimuli S1. The counts of
þg and g LMs in normal and different ILs of CLP speech
in the case of sentence S1 are tabulated in Table III. Here,
the same number of speakers is considered in each IL. Apart
from this, we have also studied the difference in overall
counts of þg and g LMs between normal and CLP speech.
The counts of þg and g LMs in normal and CLP speech
for the sentence S1 is (167, 167) and (174, 174), respectively. From Table III it can be seen that the number of g
LMs is increased in the case of CLP speech. In this experiment, the S1 sentence uttered by 40 healthy and 40 CLP
speakers is considered.
B. Performance evaluation
The log-likelihood scores of different intelligibility
groups are shown in Fig. 2 for all the explored features. As
the intelligibility degrades, the mean log-likelihood scores
also decrease for all the features used herein. It can be seen
from Fig. 2 that the discrimination is more in the case of the
Mel-2DDCT based þg model [Fig. 2(c)] and less
FIG. 2. (Color Online) Box plots of the
log-likelihood scores for different levels of intelligibility in the case of sentence S1. (a) MFCCs (þg model), (b)
MFCCs (g model), (c) Mel-2DDCT
(þg model), and (d) Mel-2DDCT (g
model).
J. Acoust. Soc. Am. 144 (5), November 2018
Kalita et al.
2659
discrimination is observed for the MFCC based g model
[Fig. 2(b)]. For quantitative evaluation, the Spearman’s rank
correlation coefficient (q) between the estimated scores and
the perceptual rating of intelligibility are computed. Since
both the variables, i.e., perceptual ratings and log-likelihood
scores are non-normally distributed, Spearman’s rank correlation coefficient will be appropriate.17 Since we have less
amount of speech data, leave-one-speaker-out crossvalidation (LOSO-CV) is carried out for the performance
evaluation of each sentence-level stimuli. The acousticphonetic composition of each sentence stimuli is different;
thus the number of Gaussians which can properly model the
acoustic space of each sentence stimuli will be different.
Hence, we have experimented for the different number of
component Gaussians to build the GMM, and the number of
Gaussians with the best result is considered for the
evaluation.
For each fold of LOSO-CV, all the CLP children’s utterances except one speaker’s speech utterance are used to
build a linear regression model. To build the regression
model, log-likelihood scores and respective perceptual ratings are considered as the independent and dependent variables of the model, respectively. The derived regression
model is used to estimate the intelligibility scores of the
remaining CLP children’s speech utterances. This step is
repeated for 41 folds, as speech data of 41 CLP individuals
are considered, and the intelligibility score of test utterances
are computed at each fold. In the end, the Spearman rank
correlation coefficient between predicted and perceptual
intelligibility ratings is calculated to determine the prediction
accuracy. This cross-validation process is applied separately
for each sentence stimuli used herein.
Initially, the correlation between estimated objective
intelligibility scores and subjective intelligibility ratings are
considered for the S1 sentence, and the results of both features are listed in Table IV. The best result of the S1 sentence is obtained for the GMM with 64 numbers of
Gaussians. It can be clearly observed that the correlation values are relatively high for the þg model than the g model
for both MFCCs (q ¼ 0.74, 0.63) and Mel-2DDCT features
(q ¼ 0.78, 0.67), respectively. This high correlation in the
case of the þg model is justified, as it captures the characteristics of transition regions and preceding obstruent regions
most of the times. The least correlation value is observed in
the case of the MFCC based g model (q ¼ 0.63). For both
þg and g, Mel-2DDCT gives a comparatively high correlation than MFCCs. The high correlation in the case of Mel2DDCT features shows the significance of JSTFs in better
representing the acoustic characteristics near the LMs. Later
for the remaining ten sentence-level stimuli, correlations
between the objective intelligibility scores and the perceptual ratings are calculated. Then the average of ten individual
sentence-level correlations is considered for the evaluation
of overall performance of the system. Table V shows the
average correlation for all the sentence-level stimuli and
similar observations can be made from it, as discussed before
in the case of the S1 sentence.
To study the statistical significance of the difference in
correlation values the Williams pair-wise significance test18
is performed. In this case, we perform the significance test
for each pair of acoustic correlates of CLP speech intelligibility. Table VI lists the outcomes of these tests. In the table,
each p-value inside a cell (i, j) indicates whether measure i
(named in the rightmost column of the table) is correlated
significantly higher with the perceptual ratings than measure
j (named in the bottom of the table). From Table VI we can
clearly see that the increase correlation for 2DDCT based
þg model (2DDCTþg) relative to the 2DDCT based g
model (2DDCTg) and MFCC based g model (MFCCg)
is statistically significant at p < 0.05. Also, increase correlation for MFCC based þg model (MFCCþg) relative to
MFCCg is statistically significant. However, the increase in
the correlation for other pairs of acoustic correlates is not
statistically significant at p < 0.05.
We have shown that it is indeed possible to derive the
acoustic correlates of the CLP speech intelligibility by extracting acoustic features in the vicinity of the g LMs. The potentiality of both the g LMs in estimating the intelligibility is
shown individually. Since the features are extracted in the
vicinity of the abrupt spectral change, acoustic features which
can capture these abrupt spectral discontinuities are required.
The 2D-DCT based JSTFs can provide the better explicit representation of the temporal dynamics present in the transition
region between two sounds as compared to MFCCs. For both
LMs, the Mel-2DDCT feature outperforms MFCCs, which signifies the importance of JSTFs in deriving the acoustic correlates of intelligibility. The JSTFs may retain the critical
discriminatory information in the time-frequency plane about
TABLE IV. Spearman’s rank correlation (q) between subjective intelligibility scores and mean log-likelihood scores at LMs (þg and g) of the S1
sentence for different features.
TABLE VI. p-values of Williams significance test between pairs of acoustic
correlates of intelligibility.
þg location
#
g location
Features
q
p
q
P
0.74
0.78
<0.001
<0.001
0.63
0.67
<0.001
<0.001
1
2
MFCCs
Mel-2DDCT
2660
J. Acoust. Soc. Am. 144 (5), November 2018
TABLE V. Average of ten individual sentence-level correlations for overall
performance evaluation.
þg location
#
1
2
g location
Features
MFCCs
Mel-2DDCT
—
—
—
—
2DDCTþg
0.05233
—
—
—
MFCCsþg
q
p
q
p
0.73
0.78
<0.001
<0.001
0.61
0.65
<0.001
<0.001
0.03590
0.14739
—
—
2DDCTg
0.00355
0.04028
0.20004
—
MFCCsg
2DDCTþg
MFCCsþg
2DDCTg
MFCCsg
Kalita et al.
the articulatory deviations in the CLP speech, which helps to
better discriminate among the groups.
This work only studies the sentence-level intelligibility
scores by exploiting the acoustic features around the g LMs.
However, the global intelligibility score for each speaker is
not explored. Since for each sentence-level stimuli a separate
GMM is needed, this may lead to the complexity of the proposed algorithm. Even so, if normal speakers produce
heavily voiced obstruent consonants, the detection of the g
LMs may not be proper in their speech, which may degrade
the performance of the proposed algorithm. Further refinement of the g LM detection will be needed in the case of
heavily voiced obstruent consonants.
IV. CONCLUSION AND FUTURE DIRECTIONS
In this work, the importance of g LMs in deriving the
acoustic correlates of CLP speech intelligibility is studied.
Two acoustic features, namely, MFCCs and Mel-2DDCT
features, are extracted in the vicinity of the g LMs, which
characterize the acoustic deviation near those LMs. For each
sentence stimuli, two separate sentence-specific GMMs are
built for þg and g LMs using the extracted features. While
testing, utterance wise mean log-likelihood scores are computed from the respective GMMs, which is considered as the
proposed acoustic correlates of CLP speech intelligibility.
Results show that Mel-2DDCT based þg GMM gives the
highest correlation (q ¼ 0.78), while the MFCC based g
model gives the lowest correlation (q ¼ 0.61) with the perceptual ratings. Since the speech is analyzed at the abrupt
transitions regions, the Mel-2DDCT feature is found to outperform the MFCCs. The current study may help to define a
set of acoustic measures correlated with intelligibility and
can be used as the biomarker for speech progression during
therapy. Unlike the ASR based methods, the proposed
method only explores the acoustic information around the
LMs, no linguistics information is explored. LMs are language invariant; therefore it may be easy to configure the
proposed system for other languages as well. However,
proper validation is needed to justify its language
independence.
Future work is planned to explore the usefulness of the
derived log-likelihood scores around the g LMs to study the
correlation of different articulation errors with the intelligibility degradation. The global intelligibility score of each
speaker will be examined using the sentence-level scores in
future works. In this work, both þg and g log-likelihood
scores are considered independent as the acoustic correlates
of the intelligibility. The combination of both the loglikelihood scores for the acoustic correlates of intelligibility
will be analyzed in a future work.
ACKNOWLEDGMENTS
The authors would like to thank Professor M.
Pushpavathi, Professor Ajish K. Abraham and expert SLPs
J. Acoust. Soc. Am. 144 (5), November 2018
of AIISH, Mysore, India for their valuable contribution in
the perceptual evaluation of speech and suggestions. This
work is in part supported by the project grants, for the
projects entitled “NASOSPEECH: Development of
Diagnostic System for Severity Assessment of the
Disordered Speech” funded by the Department of
Biotechnology (DBT), Government of India and
“ARTICULATE þ: A system for automated assessment and
rehabilitation of persons with articulation disorders” funded
by the Ministry of Human Resource Development (MHRD),
Government of India.
1
A. Kummer, Cleft Palate & Craniofacial Anomalies: Effects on Speech
and Resonance (Delmar, Clifton Park, NY, 2013).
2
G. Henningsson, D. P. Kuehn, D. Sell, T. Sweeney, J. E. TrostCardamone, and T. L. Whitehill, “Universal parameters for reporting
speech outcomes in individuals with cleft palate,” Cleft PalateCraniofacial J. 45(1), 1–17 (2008).
3
A. Maier, C. Hacker, E. Noth, E. Nkenke, T. Haderlein, F. Rosanowski,
and M. Schuster, “Intelligibility of children with cleft lip and palate:
Evaluation by speech recognition techniques,” in 18th International
Conference on Pattern Recognition (ICPR’06), Vol. 4, pp. 274–277
(2006).
4
K. Ishikawa, J. MacAuslan, and S. Boyce, “Toward clinical application of
landmark-based speech analysis: Landmark expression in normal adult
speech,” J. Acoust. Soc. Am. 142(5), EL441–EL447 (2017).
5
M. Scipioni, M. Gerosa, D. Giuliani, E. Noth, and A. Maier,
“Intelligibility assessment in children with cleft lip and palate in Italian
and German,” in Interspeech 2009 (2009).
6
L. He, J. Zhang, Q. Liu, H. Yin, and M. Lech, “Automatic evaluation of
hypernasality and speech intelligibility for children with cleft palate,” in
8th IEEE Conference on Industrial Electronics and Applications (ICIEA)
(2013), pp. 220–223.
7
T. M. DiCicco and R. Patel, “Automatic landmark analysis of dysarthric speech,” J. Med. Speech-Lang. Pathology 16(4), 213–220
(2008).
8
K. Chenausky, J. MacAuslan, and R. Goldhor, “Acoustic analysis of PD
speech,” Parkinson’s Dis. 2011, 1–13.
9
S. A. Liu, “Landmark detection for distinctive feature-based speech recognition,” J. Acoust. Soc. Am. 100(5), 3417–3430 (1996).
10
K. N. Stevens, “Toward a model for lexical access based on acoustic landmarks and distinctive features,” J. Acoust. Soc. Am. 111(4), 1872–1891
(2002).
11
C. Park, “Consonant landmark detection for speech recognition,” Ph.D.
thesis, Massachusetts Institute of Technology, 2008.
12
S. J. Peterson-Falzone, M. A. Hardin-Jones, and M. P. Karnell, Cleft
Palate Speech (Mosby, St. Louis 2001).
13
B. J. Philips and R. D. Kent, “Acoustic-phonetic descriptions of speech
production in speakers with cleft palate and other velopharyngeal disorders,” in Speech and Language (Elsevier, New York, 1984), Vol. 11, pp.
113–168.
14
See supplementary material at https://doi.org/10.1121/1.5062838 for
details of feature extraction from the speech region around g LMs.
15
J. Bouvrie, T. Ezzat, and T. Poggio, “Localized spectro-temporal cepstral
analysis of speech,” in 2008 IEEE International Conference on Acoustics,
Speech and Signal Processing (2008), pp. 4733–4736.
16
S. Str€
ombergsson, G. Salvi, and D. House, “Acoustic and perceptual
evaluation of category goodness of /t/ and /k/ in typical and misarticulated children’s speech,” J. Acoust. Soc. Am. 137(6), 3422–3435
(2015).
17
M. M. Mukaka, “A guide to appropriate use of correlation coefficient in
medical research,” Malawi Med. J. 24(3), 69–71 (2012).
18
E. J. Williams, Regression Analysis (Wiley, New York, 1959),
Vol. 14.
Kalita et al.
2661
Download