Importance of glottis landmarks for the assessment of cleft lip and palate speech intelligibility Sishir Kalita, S. R. Mahadeva Prasanna, and S. Dandapat Citation: The Journal of the Acoustical Society of America 144, 2656 (2018); doi: 10.1121/1.5062838 View online: https://doi.org/10.1121/1.5062838 View Table of Contents: http://asa.scitation.org/toc/jas/144/5 Published by the Acoustical Society of America Articles you may be interested in Simultaneous electromagnetic articulography and electroglottography data acquisition of natural speech The Journal of the Acoustical Society of America 144, EL380 (2018); 10.1121/1.5066349 Age-related perceptual difference of Kyungsang Korean accent contrast: A diachronic observation The Journal of the Acoustical Society of America 144, EL367 (2018); 10.1121/1.5066344 Cross-language differences in how voice quality and f0 contours map to affect The Journal of the Acoustical Society of America 144, 2730 (2018); 10.1121/1.5066448 Sound field reconstruction using multipole equivalent source model with un-fixed source locations The Journal of the Acoustical Society of America 144, 2674 (2018); 10.1121/1.5064784 Similar abilities of musicians and non-musicians to segregate voices by fundamental frequency The Journal of the Acoustical Society of America 142, 1739 (2017); 10.1121/1.5005496 Nonlinear ultrasound parameter to monitor cell death in cancer cell samples The Journal of the Acoustical Society of America 144, EL374 (2018); 10.1121/1.5066348 Importance of glottis landmarks for the assessment of cleft lip and palate speech intelligibility Sishir Kalita,a) S. R. Mahadeva Prasanna,b) and S. Dandapat Department of Electronics and Electrical Engineering, Indian Institute of Technology Guwahati, Guwahati, Assam 781039, India (Received 15 May 2018; revised 20 September 2018; accepted 22 September 2018; published online 6 November 2018) The present work explores the acoustic characteristics of articulatory deviations near g(lottis) landmarks to derive the correlates of cleft lip and palate speech intelligibility. The speech region around the g landmark is used to compute two different acoustic features, namely, two-dimensional discrete cosine transform based joint spectro-temporal features, and Mel-frequency cepstral coefficients. Sentence-specific acoustic models are built using these features extracted from the normal speakers’ group. The mean log-likelihood score for each test utterance is computed and tested as the acoustic correlates of intelligibility. Derived intelligibility measure shows significant correlation (q ¼ 0.78, C 2018 Acoustical Society of America. p < 0.001) with the perceptual ratings. V https://doi.org/10.1121/1.5062838 [BHS] Pages: 2656–2661 I. INTRODUCTION Speech intelligibility of individuals with cleft lip and palate (CLP) is primarily degraded due to (i) hypernasality, (ii) articulation errors, (iii) nasal air emission, and (iv) voice disorders.1 It is considered as an essential measure to evaluate the speech outcome of different interventions and to estimate the overall speech production capability of the CLP children. In the clinical environment, speech-language pathologists (SLPs) assess the speech intelligibility using auditory-perceptual evaluation based subjective methods.2 Perceptual evaluation has several inherent shortcomings, such as biased judgment, intra-rater and inter-rater variabilities, the requirement of trained SLPs, and time-consuming process.3 The acoustic analysis of speech, which utilizes the production knowledge, may supplement the perceptual assessment by presenting consistent results across speakers.4 Currently, researchers have shown the significance of automatic speech recognition (ASR) techniques to quantify the intelligibility of CLP speech.3,5,6 In these approaches, word error rate is considered to quantify the speech intelligibility, and a significant correlation is observed with the SLPs perceptual scores. However, a large amount of annotated data is needed to build the acoustic and language models for ASR based systems. Recently, landmark (LM) based speech analysis is gaining research interest to evaluate the pathological speech.4,7,8 Researchers have shown the potentiality of LM based speech analysis to derive the biomarker for speech intelligibility, by characterizing the expression of LMs.4,7 LMs may guide the extraction of the important acoustic features around some specific locations of speech signal, where correlates of articulatory deviations are more salient. Despite the encouraging findings, limited works have been reported a) Electronic mail: sishir@iitg.ac.in Also at Department of Electrical Engineering, Indian Institute of Technology Dharwad, Dharwad, Karnataka 580011, India. b) 2656 J. Acoust. Soc. Am. 144 (5), November 2018 in the literature in this direction. Besides, no attempts have been made to analyze the CLP speech using the LMs. LMs are defined as the time locations of abrupt acoustic events, which are correlated with the major articulatory movements.4,7,9,10 Three types of abrupt acoustic LMs are defined for consonants: g(lottis), b(urst), and s(onorant).9 However, this work only explores the g LMs to show that they can be utilized to derive acoustic correlates of CLP speech intelligibility. The g LMs are denoted as þg and g, which represent the starting and ending locations of vocal folds’ free vibration, respectively.9 The g LMs distinguish obstruent consonants from the vowels or sonorant consonants, and the vocalic transition from obstruent to these sounds and vice versa are associated with the þg and g LMs, respectively.11 Such abrupt vocalic transition regions contain important perceptual cues to identify the consonants.11 In CLP speech, the obstruents exhibit highly deviant characteristics due to the inadequate buildup of intra-oral pressure.1,12,13 The production of glottal and pharyngeal consonants, velarization of labial and palatal obstruents, nasalized consonants, weak obstruents, and nasal fricatives are the primary maladaptive articulations compensated for the obstruents.1,3,5 These maladaptive articulation patterns are one of the primary factors to deteriorate the CLP speech intelligibility. These wrong articulation patterns result in distortion of spectro-temporal dynamics at the vicinity of the g LMs. Therefore, the acoustic features around g LMs may carry information about the deviant speech production in CLP speakers. It is hypothesized that analysis of speech region anchored around g LMs may anticipate the degree of intelligibility loss in CLP speech. The b(urst) LMs, which signify an affricate or aspirated stop burst and the offset of frication or aspiration noise due to a stop closure, may not be fired all the time in CLP speech due to the inadequate buildup of intra-oral pressure. The detection of the s(onorant) LMs is more complicated, and detection rate is abysmal 0001-4966/2018/144(5)/2656/6/$30.00 C 2018 Acoustical Society of America V as compared to the g LMs.9 Moreover, in the database, a very small number of nasals and approximants are present in the speech stimuli. The primary objective of the present work is to analyze how the acoustic information extracted from speech region around the g LMs can be used to derive correlates of CLP speech intelligibility. To investigate this, the acoustic feature which gives the explicit representation of temporal dynamics between two sounds is computed at the vicinity of both g LMs. Two separate sentence-specific Gaussian mixture models (GMMs) are built for each sentence stimuli, using the features extracted around þg and g LMs. Speech utterances from the normal speakers’ group are used to train the GMMs. GMMs derived for each sentence-level stimuli are used to compute the log-likelihood scores of the respective test utterance. The average value of the log-likelihood scores of features extracted from the speech region around g LMs is calculated. Since two separate GMMs are built for þg and g for each sentence stimuli, two average value of log-likelihood scores are obtained for each test utterance. Both the scores are studied as the acoustic correlates of CLP speech intelligibility. II. METHODS A. Database and perceptual evaluation Speech samples of both CLP and healthy groups were recorded at All Indian Institute of Speech and Hearing (AIISH) in Mysore, India. All the children with cleft had undergone primary surgery and did not have other congenital disorders and developmental problems. CLP children with adequate language abilities were only considered for the study. In this work, 41 children (16 girls and 25 boys) with CLP belonging to the age range of 6–11 yrs were considered, whereas 40 normal children (20 girls and 20 boys) with matched age, having proper speech and language characteristics, were served as controls for the study. Before the recording, ethical consent was obtained from the parents of each group of speakers. We used ten phonetically balanced sentences rich in obstruents as given in Table I. These sentences were designed by the SLPs of AIISH, especially for intelligibility TABLE I. Description of sentence-level stimuli used for intelligibility assessment [written in International Phonetic Alphabet (IPA)]. S1 ka+ge ka+lu kappu, S2 gi+t9a be+ga ho+gu, S3 d9ana d9a+ri t9appit9u, S4 ba+lu t9abala ba+risu, S5 be+˜a ka+˜ige o+˜id9a, S6 sarit9a kat9t9ari t9a+, S7 Sivana u+ru ka+Si, S8 Ta+Ta+ Tapa+t9i ko˜u, S9 paa paa bha+vua, S10 t9a+t9a t9abala t9a+ evaluation of Kannada speaking CLP individuals. Speech samples were recorded in a soundproof room using a directional microphone (Bruel & Kjær, Nærum, Denmark) with a sampling frequency of 44 kHz and 16-bit resolution on a mono channel. The database consists of around 800 (2 sessions 40 speakers 10 sentences) CLP speech utterances, while around 820 (2 sessions 42 speakers 10 sentences) normal speech utterances. Three SLPs of AIISH, Mysore having around 5 years of experience in the field of CLP speech evaluation assessed the sentence-level intelligibility by a perceptual evaluation method. The auditory-perceptual evaluation was conducted in a soundproof room. All three SLPs used the same computer setup to listen to the samples. The stimuli were presented through headphones, and the intensity held consistent among all the raters. SLPs were allowed to listen to one sample as many times before making the decision. All three SLPs rated every utterance in the database. The order of the samples presented for the evaluation was randomized. The randomization was done in the sentence level as well as speaker level. The evaluation was performed in three sessions and accomplished in two conjugative days. SLPs used a 4-point equal-appearing-interval scale to rate the intelligibility score of each sentence. The scale is ranged from 0 to 3, where, 0 ¼ near to normal, 1 ¼ mild, 2 ¼ moderate, and 3 ¼ severe. For reference purpose, speech files with different intelligibility levels (ILs) are provided in the link: https://drive.google.com/drive/ folders/1m6VeY09IAuyuCw46e3sRhwJJowEeKH80?usp¼ sharing. Also, these speech files are used to generate Fig. 1. B. Detection of g LMs The detection of þg and g LMs are based on the work proposed by Liu in Ref. 9. Initially, the wideband short-term FIG. 1. (Color online) Time waveforms with þg and g LMs, spectrograms, and log-likelihood scores at detected þg and g LMs of target sentence S1 for normal [(a), (b), and (c)], CLP IL 0 [(d), (e), (f)], CLP IL 1 [(g), (h), and (i)], CLP IL 2 [(j), (k), and (l)], and CLP IL 3 [(m), (n), and (o)], respectively. All vowels and lateral liquids of CLP speech are nasalized. Upper dashed arrows and down solid arrows represent the þg LMs and g LMs, respectively. Dashed rectangles represent the locations of target /g/ phoneme. J. Acoust. Soc. Am. 144 (5), November 2018 Kalita et al. 2657 Fourier transform (STFT) is computed from speech signal with a window size of 5 ms and a shift of 1 ms. From the derived wideband STFT, the frequency band of range 0–400 Hz is extracted.9 The energy in the band is computed by averaging the square magnitude of STFT over the corresponding frequency band. The computed band energy is passed through a two-pass system: fine and coarse to avoid the noise and to get the high time-resolution. The þg LM corresponds to an abrupt (6 dB or more) energy increase in the band energy, while g LM corresponds to a sharp energy decrease in the same band.9 The detected g LMs are considered as the anchored points in the speech signal around which acoustic features are extracted. C. Feature extraction and acoustic modeling This work explores two features, namely Mel-frequency cepstral coefficients (MFCCs) and two-dimensional discrete cosine transform (2D-DCT) based joint spectro-temporal features (JSTFs) to capture the acoustic deviations around þg and g LMs of the CLP speech. Initially, all the speech samples are down-sampled to 16 kHz and pre-emphasized with a factor of 0.97. For the computation of MFCCs, the speech signal is short-term processed with a Hanning window of size 15 ms and shift of 5 ms. The Fourier transform is computed for each short-term speech frame. Then, the Fourier magnitude spectrum is passed through the Mel-filter bank of 40-filters, and the DCT of the log magnitude of Melfilter bank output is computed. The first 13 dimensions which represent the compressed cepstral representation of the speech frame are termed as the MFCCs. Along with the base 13-dimensional MFCCs (excluding C0 coefficient), D and DD variants are also augmented, which result in a 39dimensional MFCC feature vector. See the supplemental material for a detailed procedure to extract MFCC features anchored around g LMs.14 The motivation to use the JSTFs is that they can better capture the spectral and temporal modulations present in the transition region as compared to MFCCs, especially in the case of obstruents.15 The intelligibility of CLP speech degrades primarily due to the articulation problems of obstruents. Therefore, it is expected that JSTFs may model the articulation deviations more effectively than MFCCs.16 To compute the 2D-DCT based JSTFs the Fourier magnitude spectrum is estimated from the windowed speech which is then processed by the Mel-filter banks.16 Log-magnitude of each Mel-spectral energy vector is computed, and such vectors are stacked temporally to form a matrix representation, termed as Mel time-frequency representation (Mel-TFR) in this paper. This Mel-TFR is used to extract the JSTFs. The overlapping 2D patches of Mel-TFR are extracted and projected onto the 2D-DCT basis. For each 2D-patch, the spectral and temporal extends are the number of Mel-filter banks and the number of frames related to 40 ms speech segments, respectively. 2D-DCT coefficients encode the spectrotemporal dynamics embedded in the 2D-patch. Later, loworder 2D-DCT coefficients (13 horizontal and 3 vertical) are considered, which provide a compact representation of the 2D-patch. The resultant matrix is converted to a 392658 J. Acoust. Soc. Am. 144 (5), November 2018 dimensional (13 3) vector, which is termed the Mel2DDCT feature in the present work. See the supplemental material for a detailed procedure to extract Mel-2DDCT features anchored around g LMs.14 For each utterance, þg and g LMs are detected. Then, around each þg LM, a region of 80 ms (40 ms before and 40 ms after) is considered to derive the features, so that the obstruent characteristics and information of formant transition in the adjacent sonorant sound can be captured. While around the g LM, a segment of 40 ms (40 ms before) is considered to derive features from the offset transition region. For each sentence-level stimuli, features derived from the regions around þg and g LMs are used to build two separate GMMs. These two GMMs represent the acoustic space of the corresponding sentence stimuli. For testing an utterance, features are extracted from the speech region around the þg and g LMs. Then, the loglikelihood scores for the extracted features are derived from the respective GMM. Let us consider the feature extracted from the speech segments around the þg LMs of test S1 sentence denoted as FS1 þg . Then log-likelihood scores for the feaS1 ture FS1 þg are computed from the respective GMM (kþg ), i.e., the GMM built using the features extracted around the þg LMs from the normal version of S1 sentences. A similar process is applied to compute the log-likelihood scores for the features computed around the g LMs. For each test utterance, two average values of likelihood scores are computed, one for the features around þg LM and another for g LM. These two mean likelihood scores are considered as the acoustic correlates of the CLP speech intelligibility. The process log-likelihood score computation is similar for all ten sentence-level stimuli used herein. III. RESULTS AND DISCUSSION In this section, we discuss the experimental results and performance evaluation of the proposed g LM based intelligibility assessment technique. Initially, we have assessed the inter-rater agreement of the intelligibility ratings. To compare the rating agreement, we computed Fleiss’ kappa which is used to assess the agreement between more than two raters. We found j ¼ 0.62 (p < 0.001) with Confidence Interval (95%; 0.59, 0.66), which is quite reliable to consider as the ground truth (p < 0.01). The median value of the three rater’s scores was considered as the ground truth for current work. Table II provides a detailed description of the number of CLP individuals that belong to each IL for the respective sentence-level stimuli. Then, we have analyzed the expression of g LMs in CLP speech and studied how the acoustic TABLE II. Details of speakers in each IL for each sentence-level stimuli. # sentence Intelligibility rating S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 0 1 2 3 8 11 15 7 9 11 15 6 8 10 14 9 7 13 13 8 8 12 14 7 7 8 15 11 7 11 13 10 6 11 13 12 7 9 14 11 7 11 15 7 Kalita et al. characteristics near these LMs are deviated by considering one particular sentence stimuli as a case study. Later the performance evaluation of the proposed acoustic correlates of the CLP speech intelligibility is discussed. A. Analysis of g LMs in CLP speech Figures 1(a)–1(b), 1(d)–1(e), 1(g)–1(h), 1(i)–1(k), and 1(m)–1(n) show speech signal corresponding to the target S1 sentence (see Table I) with þg and g LMs and spectrograms for normal, CLP 0, CLP 1, CLP 2, and CLP 3, respectively. In the normal speech signal, five þg and five g LMs are detected, and all the þg and g LMs are associated with the obstruent to vowel and the vowel to obstruent transition region, respectively. However, the number of g LMs detected for the speech signal of CLP 0 is reduced, as phoneme /g/ is heavily voiced [dashed rectangle in Fig. 1(d), around 0.2 s]. The acoustic characteristics near the þg and g LMs of CLP 0 are almost similar to normal, which can be seen from the respective spectrogram in Fig. 1(e). Though the number of g LMs in CLP 1 and CLP 2 speech is the same as normal, the acoustic characteristics near LMs deviate from that of normal. The acoustic characteristics represent the static and dynamics cues related to obstruents production. In CLP 1 speech, phoneme /g/ is replaced by an unvoiced sound [dashed rectangle in Fig. 1(g), around 0.3 s], while in CLP 2, it is compensated by a glottal stop [dashed rectangle in Fig. 1(j), around 0.3 s], which distorts the voice bar and formant transitions in the adjacent vowels. In the speech signal of CLP 3, all the obstruents are replaced by the nasal consonants, which results in only one þg and one g LMs [Fig. 1(m)]. Thus, apart from the deviations in the LM expression, the acoustic correlates of stop production, such as burst energy and formant transitions in the consonantvowel and vowel-consonant transition regions are also distorted. Therefore, the analysis of acoustic features around þg and g LMs may provide the degree of loss in intelligibility. Figures 1(c), 1(f), 1(i), 1(l), and 1(o) show the loglikelihood scores around the þg and g locations in the case of sentence S1 for normal, CLP 0, CLP 1, CLP 2, and CLP 3, respectively. Log-likelihood scores are computed from the TABLE III. Counts of the þg and g LMs in normal and different ILs of CLP speech in the case of sentence S1. # þg LM # g LM Normal CLP 0 CLP 1 CLP 2 CLP 3 29 29 28 28 30 30 31 31 28 28 Mel-2DDCT based þg and g models around the þg and g LMs, respectively. As the intelligibility degrades from mild to severe, the log-likelihood scores get decreased accordingly. A careful observation of the log-likelihood scores around the g LMs of CLP speech provides some information about the localized articulatory deviation near them. However, the current work only exploits the global deviations of the log-likelihood scores for each g LM. Future exploration is required to investigate the localized loglikelihood scores. We have also shown how counts of the þg and g LMs are varied regarding the intelligibility degradation, and this is done only for the sentence-level stimuli S1. The counts of þg and g LMs in normal and different ILs of CLP speech in the case of sentence S1 are tabulated in Table III. Here, the same number of speakers is considered in each IL. Apart from this, we have also studied the difference in overall counts of þg and g LMs between normal and CLP speech. The counts of þg and g LMs in normal and CLP speech for the sentence S1 is (167, 167) and (174, 174), respectively. From Table III it can be seen that the number of g LMs is increased in the case of CLP speech. In this experiment, the S1 sentence uttered by 40 healthy and 40 CLP speakers is considered. B. Performance evaluation The log-likelihood scores of different intelligibility groups are shown in Fig. 2 for all the explored features. As the intelligibility degrades, the mean log-likelihood scores also decrease for all the features used herein. It can be seen from Fig. 2 that the discrimination is more in the case of the Mel-2DDCT based þg model [Fig. 2(c)] and less FIG. 2. (Color Online) Box plots of the log-likelihood scores for different levels of intelligibility in the case of sentence S1. (a) MFCCs (þg model), (b) MFCCs (g model), (c) Mel-2DDCT (þg model), and (d) Mel-2DDCT (g model). J. Acoust. Soc. Am. 144 (5), November 2018 Kalita et al. 2659 discrimination is observed for the MFCC based g model [Fig. 2(b)]. For quantitative evaluation, the Spearman’s rank correlation coefficient (q) between the estimated scores and the perceptual rating of intelligibility are computed. Since both the variables, i.e., perceptual ratings and log-likelihood scores are non-normally distributed, Spearman’s rank correlation coefficient will be appropriate.17 Since we have less amount of speech data, leave-one-speaker-out crossvalidation (LOSO-CV) is carried out for the performance evaluation of each sentence-level stimuli. The acousticphonetic composition of each sentence stimuli is different; thus the number of Gaussians which can properly model the acoustic space of each sentence stimuli will be different. Hence, we have experimented for the different number of component Gaussians to build the GMM, and the number of Gaussians with the best result is considered for the evaluation. For each fold of LOSO-CV, all the CLP children’s utterances except one speaker’s speech utterance are used to build a linear regression model. To build the regression model, log-likelihood scores and respective perceptual ratings are considered as the independent and dependent variables of the model, respectively. The derived regression model is used to estimate the intelligibility scores of the remaining CLP children’s speech utterances. This step is repeated for 41 folds, as speech data of 41 CLP individuals are considered, and the intelligibility score of test utterances are computed at each fold. In the end, the Spearman rank correlation coefficient between predicted and perceptual intelligibility ratings is calculated to determine the prediction accuracy. This cross-validation process is applied separately for each sentence stimuli used herein. Initially, the correlation between estimated objective intelligibility scores and subjective intelligibility ratings are considered for the S1 sentence, and the results of both features are listed in Table IV. The best result of the S1 sentence is obtained for the GMM with 64 numbers of Gaussians. It can be clearly observed that the correlation values are relatively high for the þg model than the g model for both MFCCs (q ¼ 0.74, 0.63) and Mel-2DDCT features (q ¼ 0.78, 0.67), respectively. This high correlation in the case of the þg model is justified, as it captures the characteristics of transition regions and preceding obstruent regions most of the times. The least correlation value is observed in the case of the MFCC based g model (q ¼ 0.63). For both þg and g, Mel-2DDCT gives a comparatively high correlation than MFCCs. The high correlation in the case of Mel2DDCT features shows the significance of JSTFs in better representing the acoustic characteristics near the LMs. Later for the remaining ten sentence-level stimuli, correlations between the objective intelligibility scores and the perceptual ratings are calculated. Then the average of ten individual sentence-level correlations is considered for the evaluation of overall performance of the system. Table V shows the average correlation for all the sentence-level stimuli and similar observations can be made from it, as discussed before in the case of the S1 sentence. To study the statistical significance of the difference in correlation values the Williams pair-wise significance test18 is performed. In this case, we perform the significance test for each pair of acoustic correlates of CLP speech intelligibility. Table VI lists the outcomes of these tests. In the table, each p-value inside a cell (i, j) indicates whether measure i (named in the rightmost column of the table) is correlated significantly higher with the perceptual ratings than measure j (named in the bottom of the table). From Table VI we can clearly see that the increase correlation for 2DDCT based þg model (2DDCTþg) relative to the 2DDCT based g model (2DDCTg) and MFCC based g model (MFCCg) is statistically significant at p < 0.05. Also, increase correlation for MFCC based þg model (MFCCþg) relative to MFCCg is statistically significant. However, the increase in the correlation for other pairs of acoustic correlates is not statistically significant at p < 0.05. We have shown that it is indeed possible to derive the acoustic correlates of the CLP speech intelligibility by extracting acoustic features in the vicinity of the g LMs. The potentiality of both the g LMs in estimating the intelligibility is shown individually. Since the features are extracted in the vicinity of the abrupt spectral change, acoustic features which can capture these abrupt spectral discontinuities are required. The 2D-DCT based JSTFs can provide the better explicit representation of the temporal dynamics present in the transition region between two sounds as compared to MFCCs. For both LMs, the Mel-2DDCT feature outperforms MFCCs, which signifies the importance of JSTFs in deriving the acoustic correlates of intelligibility. The JSTFs may retain the critical discriminatory information in the time-frequency plane about TABLE IV. Spearman’s rank correlation (q) between subjective intelligibility scores and mean log-likelihood scores at LMs (þg and g) of the S1 sentence for different features. TABLE VI. p-values of Williams significance test between pairs of acoustic correlates of intelligibility. þg location # g location Features q p q P 0.74 0.78 <0.001 <0.001 0.63 0.67 <0.001 <0.001 1 2 MFCCs Mel-2DDCT 2660 J. Acoust. Soc. Am. 144 (5), November 2018 TABLE V. Average of ten individual sentence-level correlations for overall performance evaluation. þg location # 1 2 g location Features MFCCs Mel-2DDCT — — — — 2DDCTþg 0.05233 — — — MFCCsþg q p q p 0.73 0.78 <0.001 <0.001 0.61 0.65 <0.001 <0.001 0.03590 0.14739 — — 2DDCTg 0.00355 0.04028 0.20004 — MFCCsg 2DDCTþg MFCCsþg 2DDCTg MFCCsg Kalita et al. the articulatory deviations in the CLP speech, which helps to better discriminate among the groups. This work only studies the sentence-level intelligibility scores by exploiting the acoustic features around the g LMs. However, the global intelligibility score for each speaker is not explored. Since for each sentence-level stimuli a separate GMM is needed, this may lead to the complexity of the proposed algorithm. Even so, if normal speakers produce heavily voiced obstruent consonants, the detection of the g LMs may not be proper in their speech, which may degrade the performance of the proposed algorithm. Further refinement of the g LM detection will be needed in the case of heavily voiced obstruent consonants. IV. CONCLUSION AND FUTURE DIRECTIONS In this work, the importance of g LMs in deriving the acoustic correlates of CLP speech intelligibility is studied. Two acoustic features, namely, MFCCs and Mel-2DDCT features, are extracted in the vicinity of the g LMs, which characterize the acoustic deviation near those LMs. For each sentence stimuli, two separate sentence-specific GMMs are built for þg and g LMs using the extracted features. While testing, utterance wise mean log-likelihood scores are computed from the respective GMMs, which is considered as the proposed acoustic correlates of CLP speech intelligibility. Results show that Mel-2DDCT based þg GMM gives the highest correlation (q ¼ 0.78), while the MFCC based g model gives the lowest correlation (q ¼ 0.61) with the perceptual ratings. Since the speech is analyzed at the abrupt transitions regions, the Mel-2DDCT feature is found to outperform the MFCCs. The current study may help to define a set of acoustic measures correlated with intelligibility and can be used as the biomarker for speech progression during therapy. Unlike the ASR based methods, the proposed method only explores the acoustic information around the LMs, no linguistics information is explored. LMs are language invariant; therefore it may be easy to configure the proposed system for other languages as well. However, proper validation is needed to justify its language independence. Future work is planned to explore the usefulness of the derived log-likelihood scores around the g LMs to study the correlation of different articulation errors with the intelligibility degradation. The global intelligibility score of each speaker will be examined using the sentence-level scores in future works. In this work, both þg and g log-likelihood scores are considered independent as the acoustic correlates of the intelligibility. The combination of both the loglikelihood scores for the acoustic correlates of intelligibility will be analyzed in a future work. ACKNOWLEDGMENTS The authors would like to thank Professor M. Pushpavathi, Professor Ajish K. Abraham and expert SLPs J. Acoust. Soc. Am. 144 (5), November 2018 of AIISH, Mysore, India for their valuable contribution in the perceptual evaluation of speech and suggestions. This work is in part supported by the project grants, for the projects entitled “NASOSPEECH: Development of Diagnostic System for Severity Assessment of the Disordered Speech” funded by the Department of Biotechnology (DBT), Government of India and “ARTICULATE þ: A system for automated assessment and rehabilitation of persons with articulation disorders” funded by the Ministry of Human Resource Development (MHRD), Government of India. 1 A. Kummer, Cleft Palate & Craniofacial Anomalies: Effects on Speech and Resonance (Delmar, Clifton Park, NY, 2013). 2 G. Henningsson, D. P. Kuehn, D. Sell, T. Sweeney, J. E. TrostCardamone, and T. L. Whitehill, “Universal parameters for reporting speech outcomes in individuals with cleft palate,” Cleft PalateCraniofacial J. 45(1), 1–17 (2008). 3 A. Maier, C. Hacker, E. Noth, E. Nkenke, T. Haderlein, F. Rosanowski, and M. Schuster, “Intelligibility of children with cleft lip and palate: Evaluation by speech recognition techniques,” in 18th International Conference on Pattern Recognition (ICPR’06), Vol. 4, pp. 274–277 (2006). 4 K. Ishikawa, J. MacAuslan, and S. Boyce, “Toward clinical application of landmark-based speech analysis: Landmark expression in normal adult speech,” J. Acoust. Soc. Am. 142(5), EL441–EL447 (2017). 5 M. Scipioni, M. Gerosa, D. Giuliani, E. Noth, and A. Maier, “Intelligibility assessment in children with cleft lip and palate in Italian and German,” in Interspeech 2009 (2009). 6 L. He, J. Zhang, Q. Liu, H. Yin, and M. Lech, “Automatic evaluation of hypernasality and speech intelligibility for children with cleft palate,” in 8th IEEE Conference on Industrial Electronics and Applications (ICIEA) (2013), pp. 220–223. 7 T. M. DiCicco and R. Patel, “Automatic landmark analysis of dysarthric speech,” J. Med. Speech-Lang. Pathology 16(4), 213–220 (2008). 8 K. Chenausky, J. MacAuslan, and R. Goldhor, “Acoustic analysis of PD speech,” Parkinson’s Dis. 2011, 1–13. 9 S. A. Liu, “Landmark detection for distinctive feature-based speech recognition,” J. Acoust. Soc. Am. 100(5), 3417–3430 (1996). 10 K. N. Stevens, “Toward a model for lexical access based on acoustic landmarks and distinctive features,” J. Acoust. Soc. Am. 111(4), 1872–1891 (2002). 11 C. Park, “Consonant landmark detection for speech recognition,” Ph.D. thesis, Massachusetts Institute of Technology, 2008. 12 S. J. Peterson-Falzone, M. A. Hardin-Jones, and M. P. Karnell, Cleft Palate Speech (Mosby, St. Louis 2001). 13 B. J. Philips and R. D. Kent, “Acoustic-phonetic descriptions of speech production in speakers with cleft palate and other velopharyngeal disorders,” in Speech and Language (Elsevier, New York, 1984), Vol. 11, pp. 113–168. 14 See supplementary material at https://doi.org/10.1121/1.5062838 for details of feature extraction from the speech region around g LMs. 15 J. Bouvrie, T. Ezzat, and T. Poggio, “Localized spectro-temporal cepstral analysis of speech,” in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing (2008), pp. 4733–4736. 16 S. Str€ ombergsson, G. Salvi, and D. House, “Acoustic and perceptual evaluation of category goodness of /t/ and /k/ in typical and misarticulated children’s speech,” J. Acoust. Soc. Am. 137(6), 3422–3435 (2015). 17 M. M. Mukaka, “A guide to appropriate use of correlation coefficient in medical research,” Malawi Med. J. 24(3), 69–71 (2012). 18 E. J. Williams, Regression Analysis (Wiley, New York, 1959), Vol. 14. Kalita et al. 2661