Recognition of Patterns of Infant Cry for the Early Identification of Hypo acoustics and Asphyxia with Neural Networks Orion F. Reyes-Galaviz* & Carlos Alberto Reyes-García** *Instituto Tecnológico de Apizaco, Av. Tecnológico S/N, Apizaco, Tlaxcala, 90400, MEXICO **Instituto Nacional de Astrofísica Óptica y Electrónica, Luis E. Erro 1, Tonantzintla, Puebla, 72840, MEXICO Abstract: - In this paper we present the experimental study on infant cry aimed to the automatic recognition of patterns of normal and pathological cries. The work is mainly oriented to the identification of three types of cry, normal, hypo acoustic and asphyxia. The overall goal is to help doctors and pediatricians to make early diagnosis of some pathologies in just born babies. Here, we describe the whole development process, which consists of the acoustic processing, and the design, implementation, training and testing of the neural network. For the acoustic processing we use feature extraction techniques like LPC (Linear Prediction Coefficients) and MFCC (Mel Frequency Cepstral Coefficients), and a Feed Forward Input Delay Neural Network with training based on Gradient Descent with Adaptive Back-Propagation. We also show and compare the results of some experiments, in which we have obtained very encouraging results. Key-Words: - Infant Cry, Classification, Pattern Recognition, Neural Networks, Acoustic Features. 1 Introduction It has been found that the infant's cry has much information on its sound wave. For small infants this is a form of communication, a very limited one, but similar to the way an adult communicates. Through the infant's cry we can detect the infant's physical or psychological state [1]. An experimented parent or an expert on the field of infant care can learn to distinguish the different kinds of cry from an infant. Nonetheless, it is very difficult for all of them the detection of some kind of pathologies, based just in the perception of crying. The pathological diseases in infants are commonly detected several months, often years, after the infant is born. If any of these diseases would have been detected earlier, they could have been attended and maybe avoided by the opportune application of treatments and therapies. Two pathologies, without any notorious external evidence, which pediatricians would like to detect as early as possible are; asphyxia and hypo acoustic (deafness). The opportune diagnosis of asphyxia might save the baby’s life or avoid future sequels, which might be caused by the lack of oxygen in the brain. Also, the early recognition of deafness in just born babies is particularly important as it may otherwise impair learning and speech development. In this work we present the development and experimentation of a system that classifies three kinds of cries. The sample cries used are recordings obtained from normal, deaf and asphyxiating infants, of ages from 1 day up to one year old. For the acoustic feature extraction we used Praat. Additionally, for the implementation of the neural network to classify the infant's crying we used Matlab. In the model here presented, we classify the original input vectors, without reduction, in three corresponding classes, normal, hypo acoustic and asphyxiating cries. We have obtained scores that go from 96.67 % up to 98.67 % in precision on the classification. On the next sections we make a description of the design and implementation of the whole automatic recognition process. We describe the methods we used to perform the extraction of the acoustic characteristics, along with definitions of the proposed system, and the used algorithms. Next we show the developed system as well as the results obtained. To end up, we present the conclusions and main references. 2 State of the Art Considering the need of unveiling the information hidden in the infant cry’s waveform, at present, little research work has been done on the automatic processing and interpretation of infant cry, mainly for diagnosis purposes. Although not as many as expected, recently several studies on automatic infant cry recognition have carried out showing interesting results, and emphasizing the importance of doing research in this field. In a pioneer work done by Wasz-Hockert spectral analysis was used to identify several types of crying [11]. Using classification methodologies based on Self-Organizing Maps, Cano [12] attempted to classify cry units from normal and pathological infants. Petroni used neuronal networks [7] to differentiate between pain and no pain cry. Taco Ekkel [8] tried to classify sound of newborn cry in categories called normal and abnormal (hypoxia), and reports a result of correct classification of around 85% based on a neural network of radial basis. In [5] Reyes and Orozco classify cry samples from deaf and normal infants, obtaining recognition results that go from 79.05% up to 97.43%. 3 The Infant Cry Recognition Process the one needed to be analyzed and processed in real time applications. For that reason, in our cry recognition system we used an extraction function as a first plane processor. Its input is a cry signal, and its output is a vector of features that characterizes key elements of the cry's sound wave. All the vectors obtained this way are later fed to a recognition model, first to train it, and later to classify the type of cry. We have been experimenting with diverse types of acoustic characteristics, emphasizing by their utility the Mel Frequency Cepstral Coefficients and the Linear Prediction Coefficients. Automatic The infant cry automatic classification process (Fig 1) is basically a pattern recognition problem, similar to Automatic Speech Recognition (ASR). The goal is to take the wave from the infant's cry as the input pattern, and at the end obtain the kind of cry or pathology detected on the baby. Generally, the process of Automatic Cry Recognition is done in two steps. The first step is known as signal processing, or feature extraction, whereas the second is known as pattern classification. In the acoustical analysis phase, the cry signal is first normalized and cleaned, then it is analyzed to extract the most important characteristics in function of time. Some of the more used techniques for the processing of the signals are those to extract: linear prediction coefficients, cepstral coefficients, pitch, intensity, spectral analysis, and Holmes banks among others. The set of obtained characteristics can be represented like a vector, and each vector can be taken like a pattern. The feature vector is compared with the knowledge that the computer has to obtain the classified output. Four main pattern recognition approaches have been traditionally used: pattern comparison, stochastic models, knowledge based systems, and connectionist models. In the present work we are focused in the last approach. 4 Acoustic Processing The acoustic analysis implies the selection and application of normalization and filtering techniques, segmentation of the signal, feature extraction, and data compression. With the application of the selected techniques we try to describe the signal in terms of some of its fundamental components. A cry signal is complex and codifies more information than Acoustical Analysis Pattern Classification Figure 1 - Infant Cry Automatic Recognition Process 3.1 MFCC (Mel Coefficients) Frequency Cepstral The low order cepstral coefficients are sensitive to overall spectral slope and the high-order cepstral coefficients are susceptible to noise. This property of the speech spectrum is captured by the Mel spectrum. The Mel spectrum operates on the basis of selective weighing of the frequencies in the power spectrum. High order frequencies are weighed on a logarithmic scale where as lower order frequencies are weighed on a linear scale. The Mel scale filter bank is a series of L triangular band pass filters that have been designed to simulate the band pass filtering believed to occur in the auditory system. This corresponds to series of band pass filters with constant bandwidth and spacing on a Mel frequency scale. On a linear frequency scale, this spacing is approximately linear up to 1KHz and logarithmic at higher frequencies. Most of the recognition systems are based on the MFCC technique and its first and second order derivative. The derivatives normally approximate trough an adjustment in the line of linear regression towards an adjustable size segment of consecutive information frames. The resolution of time and the smoothness of the estimated derivative depend on the size of the segment. [3] 4 Classification of Cry Patterns ... k ƒ(0) ƒ(1) ƒ(2) ƒ(3) ƒ(6) ƒ(7) ƒ(4) ƒ(5) Fig. 2 The MFCC filter bank 3.2 LPC (Linear Prediction Coefficients) Linear Predictive Coding (LPC) is one of the most powerful speech analysis techniques, and one of the most useful methods for encoding good quality speech at a low bit rate. It provides extremely accurate estimates of speech parameters, and is relatively efficient for computation. Based on these reasons, we are using LPC to represent the crying signals. Linear prediction is a mathematical operation where future values of a digital signal are estimated as a linear function of previous samples. In digital signal processing linear prediction is often called linear predictive coding (LPC) and can thus be viewed as a subset of filter theory. In system analysis (a subfield of mathematics), linear prediction can be viewed as a part of mathematical modeling or optimization [4]. The particular way in which data are segmented determines whether the covariance method, the autocorrelation method, or any of the so called lattice methods of LP analysis is used. The first method that we are using is the autocorrelation LP technique, where the autocorrelation function, for the finite-length signal s(n) is defined by [6]: where, l is the delay between the signal and his delayed version, N is the total number of examples. The prediction error function between the data segment s(n) and the prediction ŝ(n) is defined as [6]: As the order of the LP model increases, more details of the power spectrum of the signal can be approximated. The set of acoustic characteristics obtained in the extraction stage, is represented generally as a vector, and each vector can be taken as a pattern. These vectors are later are used to make the classification process. There are four basic schools for the solution of the pattern classification problem, those are: a) Pattern comparison (dynamic programming), b) Statistic Models (Hidden Markov Models HMM). c) Knowledge based systems (expert systems) and d) Connectionists Models (neural networks). For the development of the present work we'll focus on the description of the connectionist models type, known as neural networks. We have selected this kind of model, in principle, because of its adaptation and learning capacity. Besides, one of its main functions is pattern recognition, these kind of models are still under constant experimentation, but their results have been very satisfactory. 4.1 Neural Networks In a DARPA study [9] the neural networks were defined as a system composed of many simple processing elements, that operate in parallel and whose function is determined by the network's structure, the strength of its connections, and the processing carried out by the processing elements or nodes. We can train a neural network to realize a function in particular, adjusting the values of the connections (weights) between the elements. Generally, the neural networks are adjusted or trained so that an input in particular leads to a specified or desired output. This situation is shown on figure (Fig 3). Here, the network adjusts itself, based on a comparison between the output and the goal, until the output of the network equals to the goal. Generally many of these pairs of input/goals are used, on this supervised learning, to train a network. The training of a network is done trough changes on the weights based on a suitable set of input vectors. The training adjusts the connection's weights from the nodes, after obtaining an output of the network and comparing it with a wished output, with previous presentation of each one of the input vectors. The neural networks have been trained to make complex functions in many application areas including the pattern recognition, identification, classification, speech, vision, and control systems. Nowadays the neural networks can be trained to solve problems that are hard to solve with conventional methods. There are many kinds of learning and design techniques that multiply the options that a user can take [2]. In general, the training can be supervised or not supervised. The methods of supervised training are those that are more commonly used, when labeled samples are available. Between the most popular models there are the feed-forward neural networks, trained under supervision with the back-propagation algorithm. For the present work we have used variations of this basic model, these are briefly described below, Fig 3. Training of a Neural Network 4.2 Feed Forward Network Input Delay Neural The Feed-forward input delay neural network consists of N1 layers that use the dot product weight update function, which is a function that applies weights to an input to obtain weighed entrances. The first layer has weights that come from the input with the input delay specified by the user, in this case the delay is [0 1]. Each subsequent layer has a weight that comes from a previous layer. The last layer is the output of the network. The adaptation is done by means of any training algorithm. The performance is measured according to a specified performance function [2]. 4.3 Training by Gradient Descent With Adaptive Learning Rate Back Propagation The training by gradient descent with adaptive learning rate back-propagation, proposed for this project, can train any network as long as its weight, net input, and transfer functions have derivative functions. Back-propagation is used to calculate derivatives of performance with respect to the weight and bias variables. Each variable is adjusted according to gradient descent. At each training epoch, if the performance decreases toward the goal, then the learning rate is increased. If the performance increases, the learning rate is adjusted by a decrement factor and the change, which increased the performance, is not made [2]. 5 System Implementation for the Crying Classification In the first place, the infant cries are collected by recordings obtained directly from doctors of the Instituto Nacional de la Comunicación Humana (National Institute of the Human Communication) and IMSS Puebla. This is done using SONY digital recorders ICD-67. The cries are captured and labeled in the computer with the kind of cry that the collector orally mentions at the end of each recording. Later, each signal wave is divided in segments of 1 second; these segments are labeled with a pre-established code, and each one constitutes a sample. From here, for the present experiments we have a corpus made up of 157 samples of normal infant cry, 879 of hypo acoustics, and 340 with asphyxia. At the following step the samples are processed one by one extracting their acoustic characteristics, LPC and MFCC, by the use of the freeware program Praat 4.0 [10]. The acoustic characteristics are extracted as follows: for every second we extract 16 coefficients from each 50-millisecond frame, generating with this vectors with 304 coefficients by sample. The neural network and the training algorithm are implemented with the Matlab's Neural Network Toolbox. The neural network's architecture consists of 304 neurons on the input layer, a hidden layer with 120 neurons, and one output layer with 3 neurons. In order to make the training and recognition test, we select 157 samples randomly on each class. The number of samples of normal cry available determines this number. From them, 107 samples of each class are randomly selected for training. With these vectors the network is trained. The training is made until 500 epochs have been completed or a 1x10-6 error has been reached. After the network is trained, we test it with the 50 separated samples of each class left from the original 157 samples. The recognition accuracy is then shown in a confusion matrix. 6 Experimental Results The classification accuracy was calculated by taking the number of samples correctly classified, divided by the number total of samples. The detailed results of three tests of each kind of acoustic characteristic used, LPC and MFCC, with samples of 1 second, with 16 coefficients for every 50 ms frame, are shown in the following confusion matrices. Results using LPC. Normal Deaf Asphyxia Normal Deaf Asphyxia Normal 48 2 0 Normal 50 0 0 Deaf 2 48 0 Deaf 1 49 0 Asphyxia 0 0 50 Asphyxia 0 0 50 Table 1. Confusion Matrix showing a 97.33% efficiency after 253 training epochs using LPC acoustic characteristics. Table 6. Confusion Matrix showing a 99.33% efficiency after 252 training epochs using MFCC acoustic characteristics. Normal Deaf Asphyxia Normal 45 5 0 Deaf 0 50 0 Asphyxia 0 0 50 Table 2. Confusion Matrix showing a 96.67% efficiency after 262 training epochs using LPC acoustic characteristics. Normal Deaf Asphyxia Normal 48 2 0 Deaf 0 50 0 Asphyxia 0 0 50 Table 3. Confusion Matrix showing a 98.67% efficiency after 252 training epochs using LPC acoustic characteristics. Fig 4. Training with LPC feature vectors Results using MFCC. Normal Deaf Asphyxia Normal 46 4 0 Deaf 0 50 0 Asphyxia 0 0 50 Table 4. Confusion Matrix showing a 97.33% efficiency after 296 training epochs using MFCC acoustic characteristics. Normal Deaf Asphyxia Normal 48 2 0 Deaf 0 50 0 Asphyxia 0 0 50 Table 5. Confusion Matrix showing a 98.66% efficiency after 342 training epochs using MFCC acoustic characteristics. Fig 5. Training with MFCC feature vectors As can be noticed, there are some differences on the results obtained for each set of three experiments, using the same kind of acoustic features extracted. Given that the experiments were done under the same parameters, and all other circumstances being equal, the variations on the results may be caused by the nature of the randomly selected samples for training the recognizer. It means that among the samples in the corpus, there should be samples containing more useful information than others. So, when selected at random, the set of selected samples for a given experiment with lower score, must contain a greater number of less significant samples than a set selected for an experiment with a higher score. We are working to make more robust, in this sense, the classification performance of our system. It can also be observed in Figures 4 and 5 that the network convergence, when training with MFCC, is faster than when training with LPC. At the same time, the results obtained are better in general when using MFCC. 7 Conclusions and Future Work This work demonstrates the efficiency of the feed forward input delay neuronal network, overall using the Mel Frequency Cepstral Coefficients. It is also shown that the results obtained, of up to 99.33 %, are a little better than the ones obtained in other previous works. These results can also have to do with the fact that we use the original size vectors, with the objective of preserving all useful information. In order to compare the obtained performance results, and to reduce the computational cost, we plan to try the system with an input vector reduction algorithm by means of evolutionary computation. This is for the purpose of training the network in a shorter time, without decreasing accuracy. We are also still collecting well-identified samples from the three kinds of cries, in order to assure a robust training. Among the works in progress of this project, we are in the process of testing new neural networks, and also testing new kinds of hybrid models, combining neural networks with genetic algorithms and fuzzy logic, or other complementary models. Acknowledgments This work is part of a project that is being financed by CONACYT-Mexico (37914-A). We want to thank Dr. Edgar M. Garcia-Tamayo and Dr. Emilio ArchTirado for their invaluable collaboration when helping us to collect the crying samples, as well as their wise advice. References: [1] Bosma, J. F., truby, H. M., & Antolop, W. Cry Motions of the Newborn Infant. Acta Paediatrica Scandinavica (Suppl.), 163, 61-92. 1965. [2] Limin Fu, Neural Networks in Computer Intelligence. McGraw-Hill International Editions, Computer Science Series, 1994. [3] Gold, B., Morgan, N. (2000), Speech and Audio Signal Processing: Processing and Perception of Speech and Music, John Wiley & Sons, Inc. [4] Tebelskis J., Speech Recognition Using Neural Networks, PhD Dissertation, Carnegie Mellon University, 1995. [5] Orozco Garcia, J., Reyes Garcia, C.A., MelFrequency Cepstrum Coefficients Extraction from Infant Cry for Classification of Normal and Pathological Cry with Feed-forward Neural Networks, ESANN 2003, Bruges, Belgium, 2003. [6] Markel, John D., Gray, Augustine H., Linear prediction of speech. New York: SpringerVerlag, 1976. [7] Marco Petroni, Alfred S. Malowany, C. Celeste Johnston, Bonnie J. Stevens, Identification of pain from infant cry vocalizations using artificial neural networks (ANNs), The International Society for Optical Engineering. Volume 2492. Part two of two. Paper #: 2492-79, 1995. [8] Ekkel, T, ``Neural Network-Based Classification of Cries from Infants Suffering from HypoxiaRelated CNS Damage'', Master Thesis. University of Twente, The Netherlands, 2002. [9] DARPA Neural Network Study, AFCEA International Press, 1988. [10] Boersma, P., Weenink, D. Praat v. 4.0.8. A system for doing phonetics by computer, Institute of Phonetic Sciences of the University of Amsterdam. February, 2002. [11] O. Wasz-Hockert, J. Lind, V. Vuorenkoski, T. Partanen y E. Valanne, El Llanto en el Lactante y su Significación Diagnóstica, CientificoMedica, Barcelona, 1970. [12] Sergio D. Cano, Daniel I. Escobedo y Eddy Coello, El Uso de los Mapas Auto-Organizados de Kohonen en la Clasificación de Unidades de Llanto Infantil, Grupo de Procesamiento de Voz, 1er Taller AIRENE, Universidad Catolica del Norte, Chile, 1999, pp 24-29.