Biomedical Signal Processing and Control 56 (2020) 101734 Contents lists available at ScienceDirect Biomedical Signal Processing and Control journal homepage: www.elsevier.com/locate/bspc Convolutional neural network approach for automatic tympanic membrane detection and classification Erdal Başaran a , Zafer Cömert b,∗ , Yüksel Çelik a a b Karabük University, Department of Computer Engineering, Karabük, Turkey Samsun University, Department of Software Engineering, Samsun, Turkey a r t i c l e i n f o Article history: Received 5 February 2019 Received in revised form 26 August 2019 Accepted 13 October 2019 Available online 30 October 2019 Keywords: Biomedical signal processing Clinical decision support system Otitis media Tympanic membrane detection Convolutional neural network Classification a b s t r a c t Otitis media (OM) is a term used to describe the inflammation of the middle ear. The clinical inspection of the tympanic membrane is conducted visually by experts. Visual inspection leads to limited variability among the observers and includes human-induced errors. In this study, we sought to solve these problems using a novel diagnostic model based on a faster regional convolutional neural network (Faster R-CNN) for tympanic membrane detection, and pre-trained CNNs for tympanic membrane classification. The experimental study was conducted on a new eardrum dataset. The Faster R-CNN was initially applied to the original images. The number of images in the dataset was subsequently increased using basic image augmentation techniques such as flip and rotation. We also evaluated the success of the model in the presence of various noise effects. The original and automatically extracted tympanic membrane patches were finally input separately to the CNNs. The AlexNet, VGGNets, GoogLeNet, and ResNets models were employed. This resulted in an average precision of 75.85% in the tympanic membrane detection. All CNNs in the classification produced satisfactory results, with the proposed approach achieving an accuracy of 90.48% with the VGG-16 model. This approach can potentially be used in future otological clinical decision support systems to increase the diagnostic accuracy of the physicians and reduce the overall rate of misdiagnosis. Future studies will focus on increasing the number of samples in the eardrum dataset to cover a full range of ontological conditions. This would enable us to realize a multi-class classification in OM diagnosis. © 2019 Elsevier Ltd. All rights reserved. 1. Introduction Otitis media (OM), which is a type of middle ear infection, is one of the most common pediatric diseases [1]. OM usually develops as a complication of the upper respiratory tract starting from the nasal cavity. It is a global health problem, which can either heal spontaneously or cause serious undesirable conditions such as speech defects, hearing loss, and cognitive disorders [2]. Almost two-thirds of all children under the age of seven experience this disease [3]. Additionally, it is also one of the leading causes of hearing loss in childhood [4]. Various technological advances in medicine such as otoscopy [5], pneumootoscopy, acoustic reflectometry, ultrasound evaluation, digital imaging, combined tympanometry/visual evaluation, and acoustic tympanometry have begun to play an ∗ Corresponding author at: Canik Yerleşkesi Gürgenyatak Mahallesi Merkez Sokak No: 40-2/1 55080, Canik/SAMSUN, Turkey. E-mail addresses: ebasaran@beu.edu.tr (E. Başaran), zcomert@samsun.edu.tr (Z. Cömert), yukselcelik@karabuk.edu.t (Y. Çelik). https://doi.org/10.1016/j.bspc.2019.101734 1746-8094/© 2019 Elsevier Ltd. All rights reserved. increasingly important role in the detection of otitis illnesses. In clinical practice, otoscopy devices are frequently used to diagnose OM by examining the status of the eardrum. Otoscope devices in the current standard of care consist of a camera, halogen light source, low-power magnifying lens, and port that connects to a computer for the storage of images and videos [1,6]. OM is the most common clinical manifestation of perforation in the eardrum, with inflammation and fluid accumulation in the middle ear [7]. There are different types of OM corresponding to various deformations in the eardrum. OM is generally categorized into three different classes. The first is acute otitis media (AOM), which is usually caused by the bacteria in the middle ear cavity and mainly develops after a cold or flu [8,9]. AOM is also the most common type of OM. 90% of children below two years tend to undergo acute otitis [10]. The basic symptoms for the diagnosis of AOM include the presence of liquid in the middle ear, bulging of the tympanic membrane or a reduction in its movements, redness on or liquid behind the membrane, and the absence of the tympanic membrane [11]. AOM can also be classified as mild, moderate, or severe based on the tympanic membrane findings and clinical 2 E. Başaran, Z. Cömert and Y. Çelik / Biomedical Signal Processing and Control 56 (2020) 101734 symptoms [12], and is one of the common reasons for the prescription of antibiotics [13]. The second common type of OM is the effusion otitis media (EOM), which exhibits fluid accumulation in the middle ear minus the symptoms of the AOM [14]. The otoscopic examination of the membrane has revealed the basic symptoms of EOM to include the accumulation of the mucoid, presence of an air-fluid level or loss of gloss, and bubbles behind the membrane [15,16]. The early diagnosis of EOM is important as its symptoms are insidious [2]. The last common type of OM is the chronic suppurative otitis media (CSOM) [17], which is common among children and leads to a decrease in the quality of life. CSOM causes longterm damage via infection and inflammation of the middle ear. The symptoms of CSOM include the perforation of the eardrum or its adherence to the middle ear wall, continuous or chronic discharge from the middle ear, and erosion of the middle ear ossicles. Vaccinations and antibiotics are routinely used clinically for treatment and prevention. Several guidelines have also been published for the identification and description of detailed tympanic membrane findings, which suggest appropriate diagnosis and initial treatment plans for patients [9,12,18]. The tympanic membrane and ear canal are currently examined visually in the clinic. Visual inspection leads to limited variability among the observers during diagnosis, includes human-induced errors, and is not objective [13]. Moreover, there is limited usage of computer-aided diagnosis or expert systems in this field [19]. In this study, we adopted convolutional neural networks (CNNs) for automatic tympanic membrane detection and classification tasks to overcome the above-mentioned disadvantages and enable increasingly objective examinations. A set of image processing techniques is used in the computational approach to focus on the region of interest (ROI). Troublesome feature extraction processes are carried out to describe the eardrum images before the classification task [20]. Complete and coherent diagnosis tools are useful for addressing the above-mentioned disadvantages. The object detection and classification tasks can be incorporated with high accuracy using CNNs [21]. We have thus devised a faster regional convolutional neural network approach (Faster R-CNN) for automatic tympanic membrane detection, as an assessment of its physical status is critical for evaluating the degree of disease, selecting a treatment method, and for automated OM diagnosis. We evaluated the performance of the Faster R-CNN method for the automatic detection of the tympanic membrane. Furthermore, the original and automatically detected tympanic membrane patches were separately input to the pre-trained deep CNN models. We employed the AlexNet [22], VGGNets [23], GoogLeNet [24], and ResNets [25] models. The classification task was treated as a binary classification due to the limited number of samples, even though there are three common types of OM as discussed previously. This enabled us to provide a consistent diagnosis model for the detection and classification of the tympanic membrane. 2. Related studies Various computational approaches have been proposed in the literature to evaluate eardrum images. A multi-class classification task covering AOM, EOM, and no-effusion has been introduced based on a vocabulary and grammar approach, wherein otoscopists and engineers have described an extensive feature set. Otitis media vocabulary matches the visual cues of the disorders, while the otitis media grammar corresponds to the use of the vocabulary set in the decision process. The researchers achieved a classification accuracy of 89.9% in the decision process [26]. A preliminary study has been presented based on image processing techniques and color distribution for Otorhinolaryngology. Researchers have previously combined the probability density function of the color compo- nent, Bayesian decision rule, and two regression models. It was proposed that color by itself could not present adequate discriminative features for the identification of otitis [27]. An interactive decision support tool known as the Cyclops Auris Wizard has been introduced in daily clinical routine for the quantitative analysis of medium ear pathology extensions. Software incorporating digital image processing and geometry techniques for otology pathologies has been designed, while taking the visual and subjective assessments of a specialist into consideration. This software can measure the perforation proportion of the eardrum [28]. A DepthFirst Search algorithm has been developed to identify OM at home and shorten the diagnosis timeline by transferring the real-time OM images via smartphones [29]. A complete hybrid feature-based system composed of the segmentation, feature extraction, feature selection, and classification steps has been developed to categorize different types of OM. Eardrum images in the above study were segmented using active contours, while the histogram of oriented gradient (HOG) and local binary pattern (LPB) descriptors were used to extract the features. Finally, the AdaBoost algorithm was employed for feature selection and classification, which resulted in a 88.06% classification accuracy [30]. The OM classification task has been realized using global image features and six machine learning algorithms namely, the k-nearest neighbor (kNN), decision tree (DT), linear discriminant analysis (LDA), Naive Bayes, multi-layer neural networks (MLN), and support vector machine (SVM) methods. The experimental results of the study indicated that the SVM produced the best mean classification of 72.04% [31]. A portable video otoscopy platform has also been presented to enhance the quality of the eardrum images. The proposed model uses digital image processing techniques, along with a continuity-based segmentation and Laplacian kernel [32]. Another model employing image processing techniques and the decision tree classification algorithm has been introduced for the automated diagnosis of OM. This model could diagnose the AOM, EOM, earwax or foreign body obstruction, and identify the normal tympanic membrane; it achieved an 80.6% classification accuracy. A model incorporating a low-cost custom-made video-otoscope has previously achieved a classification accuracy of 78.7% [33]. A smartphone and cloudbased system for the automated diagnosis of otitis media has also been proposed. Both image processing techniques and neural networks have been employed for diagnosis. The proposed system can detect five different types of OM with an accuracy of 86.84% using a neural network [34]. The rest of this paper is organized as follows: Section 3 describes the materials and methods used in this study. The results and discussion have been presented in Sections 4 and 5, respectively. Lastly, Section 6 details the concluding remarks. 3. Material and methods Tympanic membrane images were collected from patients who volunteered for the study. The location of the tympanic membrane was determined by three experienced otolaryngologists. A set of preprocessing procedures, described in Section 3.1.2, were applied to the images before training the models. The pre-trained deep CNN models were used for the classification task. The original and automatically determined tympanic membrane patches were separately input to the pre-trained deep CNN models. The flowchart of the proposed model is illustrated in Fig. 1. 3.1. The eardrum data set 3.1.1. Image acquisition In this study, we generated a new eardrum data set by collecting images from eligible patients examined at the Özel Van Akdamar E. Başaran, Z. Cömert and Y. Çelik / Biomedical Signal Processing and Control 56 (2020) 101734 3 Fig. 1. The flowchart of the proposed model for automatic tympanic membrane detection and classification. Hospital in Turkey between 10/2018 and 1/2019. An experienced otolaryngologist initially examined the patients using a standard otoscopy device. The eardrum images from this device were saved on a personal computer via a USB connection. The otolaryngologist’s diagnosis was subsequently saved by locating the images in the predefined folders. The folder names were used as labels for storing the images. The location of the tympanic membrane was labeled using the Image Labeler App in MATLAB (R2018b). This procedure enabled us to obtain ground truth-values. In the final stage, two other otolaryngologists validated the location of the tympanic membrane in each otoscope image and determined its class. On other words, each image in the data set was assigned an output label based on the majority vote of three experts. The details of this voting are presented as supplementary material. In addition, high quality images suitable for image processing were then selected. This led us to obtain 282 eardrum images of various types of OM from over 950 samples. Sequential images from the same patients and corrupted images resulting from hand shaking, insufficient light, and low quality were identified in the elimination process. The images in which the tympanic membrane could not be clearly identified due to various miscellaneous reasons were also isolated from the data set. 3.1.2. Description of the eardrum data set A set of preprocessing steps were applied to the images before training the models. The original images had a resolution of 768 × 576 pixels. First, image contrasts were enhanced using the histogram equalization technique. The images were then resized to 64 × 64 to ensure a suitable training duration for the Faster RCNN model. The image sizes during the classification process were set to 227 × 227 pixels in AlexNet and 224 × 224 pixels in the other deep CNN models. The patients included 178 males and 104 females, aged between 2 and 71 years. The demographic information of the eardrum data set is given in Table 1. We employed image augmentation techniques such as flip and rotation without disturbing the morphological structure of the digital otoscope images to increase the success of the model and provide adequate data to the deep networks. This resulted in the number of samples in the data set being increased from 282 to 1692. Details of the sample distribution in the data set are given in Table 2. 4 E. Başaran, Z. Cömert and Y. Çelik / Biomedical Signal Processing and Control 56 (2020) 101734 Table 1 Demographic information on the eardrum data set. Patients Number Total Age (year) Range Mean Median Gender Male Female 282 blj , which is added to prevent overfitting. The result from the previous step is then passed through an activation function f (.) to generate the output of the feature map. The above-detailed process can be explained as shown in Eq. (1). ⎛ 2–71 8 5 Xlj = f ⎝ ⎞ Xil−1 ∗ kijl + blj ⎠ (1) i ∈ Mj 178 104 Table 2 shows that there are also six abnormal classes in addition to the normal ones. However, binary classification was used in the experiments, as there were insufficient samples per class. This led us to collect all abnormal types of OM in the abnormal class. We have made this data set publicly available for free download on http://www.ctganalysis.com/Category/otitis-media. where, Mj denotes the input map selection. Recently, rectified linear activation functions (ReLU) are increasingly being used due to their success over logistic sigmoid and hyperbolic tangent functions [36]. Pool layers (POOL): Pooling layers realize a downsampling operation using different techniques such as average or maximum pooling. This reduces the number of computational nodes and prevents overfitting [37]. This process can be expressed as shown in Eq. (2). 3.2. Image augmentation techniques Xlj = down Xjl−1 We also used an augmented data set in addition to the original data set, which was derived from the original samples using image augmentation techniques. Image augmentation is a useful technique to increase the success of the model when there are limited samples in a data set. Popular basic image augmentation techniques include Flip, Rotate, Scale, Crop, Transition, and Noise. Furthermore, generative adversarial networks (GANs), interpolation, and machine learning algorithms can be used for the above purpose. We only utilized the Flip and Rotate techniques in the experiments to avoid disturbing the morphological structure of the samples in the data set. A Flip procedure involves a simple static transformation to generate a new image by a mirror-reversal of the original along the horizontal or vertical axis. A new image is generated in the Rotate method by changing the angle (deg) in a counterclockwise direction around its center. Two additional samples were generated from each original image by applying the flip operation along the horizontal and vertical axes. Furthermore, three additional samples were generated by employing the rotate method with angles of 90, 180, and 270 degrees. The number of samples in the original data set was thus increased from 282 to 1692. Fig. 2 shows the result of the image augmentation process for a sample in the eardrum data set. 3.3. Principles of convolutional neural networks CNN models comprise several deep layers, each of which has a different task in the architecture. Convolution, pooling, and fully connected layers are some of the commonly used layers [35]. Convolutional layers (CONV): The convolution layers identify distinctive local features in the input images. The feature maps of the previous layers can be denoted by Xil−1 . These layers are convolved with the learnable kernels kijl and a trainable bias parameter (2) where, the down(.) function represents the downsampling operation. This approach provides a summary of the local distinctive features. Fully connected layers (FC): The data are passed through several convolutions and subsampling layers to realize a full connection between the neurons and all activation in the previous layers, via fully connected layers. The aim of an FC layer is to use the discriminative features to classify the input image into various classes based on the training data set [38]. Optimization: The training process is executed using optimizers such as RMSprop, stochastic gradient descent with momentum (SGDM), and adaptive moment estimation (ADAM). The weights in the SGDM method are regularly updated for each training set to reach the goal at the earliest [39]. Vt = ˇVt−1 + a∇ w L (W, X, y) (3) where, L symbolizes the loss function, ␣ is the learning rate, and W are the weights to be updated according to Eq. (4). W = W − aVt . (4) The RMSProp optimizer adapts to the average of the slope weights and keeps the learning rates per parameter. This approach operates well in both online and non-stationary situations and fulfills the parameter update using momentum on the scaled slope [40]. The ADAM optimizer updates the learning rate in each iteration and adopts the parameter learning rates based on the average first moment in the RMSProp method. It also uses the average of the second moments of the slopes. This method has been designed to have the advantages of the RMSProp method [41]. In this study, we used well-known pre-trained deep models for the classification task namely, the AlexNet [22], VGGNets [23], GoogLeNet [42], and ResNets [25] models. Table 2 The distribution of the samples in the eardrum data set. Description Class # of samples # of augmented samples The healthy tympanic membranes Normal Abnormal AOM Earwax Miringoskleroz Tympanostomy tubes CSOM Otitis externa 154 128 69 21 4 2 14 18 924 768 414 126 24 12 84 108 Abnormal or suspicious tympanic membranes E. Başaran, Z. Cömert and Y. Çelik / Biomedical Signal Processing and Control 56 (2020) 101734 5 Fig. 2. An illustration of the image augmentation process for a sample in the eardrum data set. (a) Original image. (b) Horizontally flipped. (c) Vertically flipped. (d) 90 degrees rotated. (e) 180 degrees rotated. (f) 270 degrees rotated. Fig. 3. A block diagram of Region Proposal Network (RPN) [46]. • AlexNet is a basic, pioneer deep model that was introduced in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC2012) [22]. This network has a depth of eight, with 61 million computational parameters. Its architecture contains five CONV layers followed by three FC layers, with a few max-POOL layers in the center. A ReLU is employed in each CONV and FC layer to enable faster training. The input layer of the AlexNet model takes images of size 227 × 227 pixels. • The VGG network is another important deep CNN model. This model improves the general performance by raising the network depth to 16 and 19 [23]. This architecture includes 16 or more CONV/FC layers. Each CONV layer uses small 3 × 3 convolution filters, with a POOL layer inserted between each group of two or three CONV layers. The input layers of the VGG-16 and VGG-19 networks accept images of size 224 × 224 pixels. • GoogLeNet has introduced a new technical term known as the inception module, which offers a shortcut and few deeper branches. This increases the depth of the network, while maintaining a constant computational complexity. The GoogLeNet network consists of 22 layers and has solely 7 million computation parameters [42]. This indicates that the GoogLeNet network uses 12× fewer parameters as compared to AlexNet and exhibits a higher performance. • The ResNet network offers a residual learning framework. In this architecture, residual blocks are employed to provide a more suitable training time [25]. This model focuses on the degradation problem, and is novel due to the residual blocks and depth in its architecture. The stacked layers in a conventional deep learning model fit a desired underlying mapping, whereas the ResNet model permits these layers to fit a residual mapping. In this study, we adopted pre-trained deep CNN models to classify the tympanic membrane images into the normal and abnormal categories. Further details on these models can be found in the related papers. 3.4. Basic principles of faster R-CNN Object detection and classification are dominant tasks in the image processing field [43]. As the most primitive approach, sliding-window detectors are utilized by sliding windows from left and right, and from up to down to determine objects base on the classifiers. In this process, various window sizes and aspect ratios are considered. The patches obtained based on the windows are used to feed classifier and are wrapped since many classifiers allow input with the fixed size [44]. Instead of a brute force approach, a region proposal method to create ROIs is suggested for object detection. In selective search [44], first, each individual pixel is placed its own small group and then the groups are merged considering the texture for ensuring the possible ROIs. R-CNN are built based on a region proposal method utilizing 2000 ROIs with the fixed size. The regions are applied as the input to a CNN. In this manner, the distinctive features of the regions are obtained, and classification is realized using the deep network. In the architecture, fully connected layers are employed to classify the objects as well as to refine the boundary box. R-CNN have several disadvantages: it needs too many proposals to produce an accurate result. Furthermore, in addition to existing of too many overlapped regions with each other, the feature extraction process is carried out individually for 2000 different ROIs. This is a timeconsuming procedure [44]. Fast R-CNN comes with a feature extractor to prevent 2000 times repeated feature extraction process in R-CNN. It also covers an external region proposal method that has similar properties with the selective search. In order to generate ROIs proposal, the feature map ensured by the feature extractor and the external regional 6 E. Başaran, Z. Cömert and Y. Çelik / Biomedical Signal Processing and Control 56 (2020) 101734 Table 3 The analysis of Faster R-CNN. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Name Type Activations Learnables Total Learnables Image input conv 1 relu 1 conv 2 relu 2 rpnConv3 × 3 rpnRelu rpnConv1 × 1BoxDeltas rpnBoxDeltas rpnConv1 × 1ClsScores rpnSoftmax rpnClassification regionProposal roiPooling fc 1 relu 3 fc 2 softmax Classoutput fcBoxDeltas boxDeltas Image Input Convolution ReLU Convolution ReLU Convolution ReLU Convolution Box Reg. Output Convolution RPN Softmax RPN Cls. Output Region Proposal ROI Max Pooling Fully Connected ReLU Fully Connected Softmax Classification output Fully Connected Box Reg. Output 64 × 64 × 3 64 × 64 × 32 64 × 64 × 32 64 × 64 × 32 64 × 64 × 32 64 × 64 × 32 64 × 64 × 32 64 × 64 × 32 – 64 × 64 × 16 4096 × 8 × 2 – 1×4 31 × 31 × 32 1 × 1 × 64 1 × 1 × 64 1×1×2 1×1×2 – 1×1×4 – – Weights 3 × 3 × 3 × 32-Bias 1 × 1 × 32 – Weights 3 × 3 × 32 × 32-Bias 1 × 1 × 32 – Weights 3 × 3 × 32 × 32-Bias 1 × 1 × 32 – Weights 1 × 1 × 32 × 32-Bias 1 × 1 × 32 – Weights 1 × 1 × 32 × 16-Bias 1 × 1 × 16 – – – – Weights 64 × 30752-Bias 64 × 1 – Weights 2 × 64-Bias 2 × 1 – – Weigts 4 × 64-Bias 4 × 1 – 0 896 0 9248 0 9248 0 1056 0 528 0 0 0 0 1968182 0 130 0 0 260 0 Table 4 The training options of Faster R-CNN. Max Epoch Mini-Batch Size Initial Learn Rate Optimizer the fine-tuning was realized in the last two steps by updating the weights slightly. Options 1 Options 2 Options 3 Options 4 [10 −100] 1 1 × 10−4 SGDM [10 −100] 1 1 × 10−4 SGDM [10 −100] 1 1 × 10−5 SGDM [10 −100] 1 1 × 10−6 SGDM proposal method are consolidated using an ROI pooling layer. In this way, fully connected layers are fed with the patches without repeated expensive feature extraction process. Consequently, the training time of R-CNN is reduced significantly [45]. The architecture of Faster R-CNN is similar to Fast R-CNN. The region proposal is only changed with a convolutional network called region proposal network (RPN). RPN gets the feature map produced by the first CNN in the design as the input. RPN performs k guesses for each location in the feature map. As a result, it produces 4 × k coordinates and 2 × k scores per location as shown in Fig. 3. Faster R-CNN operates several anchors that are attentively pre-selected according to the real-life objects at different scales and have reasonable aspect ratios as well. In this study, we propose a Faster R-CNN for automatic tympanic membrane detection task. The details of the network architecture covering description of layers, activations, and learnable weights are given in Table 3. The training of Faster R-CNN is performed in four key steps: 1 Step 1 of 4: Training a Region Proposal Network (RPN) 2 Step 2 of 4: Training a Fast R-CNN Network using the RPN from Step 1 3 Step 3 of 4: Re-training RPN using weight sharing with Fast RCNN. 4 Step 4 of 4: Re-training Fast R-CNN using updated RPN. As mentioned above, the training of Faster R-CNN is realized in four key steps. For this reason, we have four different options for each network as presented in Table 4. In the experimental study, we also investigate the effects of the maximum epoch and learning rate on the automatic membrane detection task. For this particular purpose, the maximum epoch is tested between 10 to 100. In addition, the learning rate is investigated considering from 1 × 10−3 to 1 × 10−6 . The learning rates in the first two steps of the Faster R-CNN were adjusted higher than the last two steps. Because 3.5. Performance metrics In the model evaluation phase, the primary objective metrics for object detection task are average-precision and recall since these performance metrics have been adopted commonly in object detection task [47,48]. The efficiency of the Faster R-CNN improved on the basis of the precision and recall rates analyzing different schemes mentioned above. As for classification, we derived several common metrics from a confusion matrix consisting of true positive (TP), false positive (FP), false negative (FN) and true negative (TN) indices [49]. The formulations of the metrics are given below: Accuracy (Acc) = TP + TN TP + FP + FN + TN (5) TP TP + FN (6) Sensitivity (Se) − Recall = Specificity (Sp) = F − score = TN TN + FP (7) 2 ∗ TP 2 ∗ TP + FP + FN Average − Precision (AP) = TP TP + TN (8) (9) In object detection task, TP points a positive overlap between real and detected tympanic membrane area. FN corresponds to a missing area that model could not detect, in fact, this area matches either a part of the tympanic membrane or entire of it. TN matches the undetected area, and this area does not include any part of the tympanic membrane. The metrics given above are utilized on all test set images for producing the precision and recall (PR) curve. Ideally, it is desired to close to one on Recall axis in PR curve. In the classification task, TP and TN correspond to the numbers of correctly identified abnormal and normal tympanic membrane samples whereas FP and FN correspond to the numbers of incorrectly identified abnormal and normal tympanic membrane samples, respectively. In addition to the mentioned performance metrics, we employ Receive Characteristic Curve (ROC) since it is a useful technique to measure the model success in the binary classification task. In this scope, the area under this curve (AUC) is calculated. ROC is a E. Başaran, Z. Cömert and Y. Çelik / Biomedical Signal Processing and Control 56 (2020) 101734 Table 5 Division of the samples for training and testing. The original data set The augmented data set Train Test Train Test 196 86 1185 507 probability curve and AUC represents the degree or measure of separability. It explains how much model is capable of distinguishing between classes. The ROC curve is drawn with TP rates against the FP rates where TP rates are located on the y-axis and FP rates are located on the x-axis [50]. 4. Results All training and testing processes were run on a workstation equipped with the NVIDIA Quadro P6000 GPU and Intel(R) Xeon(R) Gold 6132 CPU @2.60 GHz CPU using the MATLAB (R2018b) software. The data in the training and testing processes of the Faster RCNN approach was divided into two parts with rates of 70% and 30%. This implies that 196 and 86 eardrum images were used to train and validate the model for the original data set, while 1185 and 507 eardrum images were used to train and validate the model for the augmented data set as described in Table 5. As previously mentioned, four different experiments were run to evaluate the success of the Faster R-CNN model in the tympanic membrane detection task. First, the model was applied to the original data set and had 282 samples. The most efficient learning rates were determined by trial and error as 1 × 10−4 , 1 × 10−4 , 1 × 10−5 , and 1 × 10−6 for the networks in the Faster R-CNN architecture. The learning rate in the experiments should be neither too large nor small to achieve ideal convergence within an appropriate training time. Additionally, the experiments considering the maximum number of epochs was determined to be between ten and 100. The best AP value of 0.6772 was obtained when the maximum number of epochs was adjusted to 25 for the original eardrum data set. The tympanic membrane was not detected in only eight of the 86 samples. The time required for the training and testing processes was 19.37 and 0.0753 min, respectively. 7 In the second experiment, the number of samples in the eardrum data set was increased from 282 to 1692 using image augmentation techniques. The experiment was repeated with the same hyperparameters. The best AP obtained was 0.7585 when the maximum number of epochs was set to 100. The tympanic membrane was not detected in only 34 of the 509 samples. The time required for training and testing the Faster R-CNN was 529.56 and 0.5220 min, respectively. In this experiment, we observed an increase in the Faster R-CNN performance in the tympanic membrane detection task as compared to the original data set. Increasing the maximum number of epochs had a positive effect on the model performance. The effects of noise on the success of the model were evaluated in the third and fourth experiments. The Gaussian and salt and pepper noises were used in the third and fourth experiments, respectively. Experiments were run on the augmented data set similar to the second experiment. This resulted in an AP value of 0.7694 for the augmented data set containing Gaussian noise. The most effective results were obtained when the maximum number of epochs was adjusted to equal 25. The time required for training and testing the model was 135.089 and 0.4346 min, respectively. The tympanic membrane was not detected in only four of the 509 samples. The model achieved the best AP value of 0.7952 for the augmented data set with salt and pepper noise. The tympanic membrane was not detected in only one sample in the test set. The training and testing processes required 111.467 min and 0.4215 min, respectively. The model was able to detect the tympanic membrane with high precision in the noisy data set. However, there was also an increase in the number of overlap regions, which is not ideal for the proposed approach as it is desirable to extract only one region identifying the tympanic membrane. Table 6 shows the detailed results of the automatic tympanic membrane detection task, with the most effective results highlighted in bold. The results of the automatic tympanic membrane detection task indicate that the model could successfully detect the tympanic membrane and was resistant to noise. Increasing the number of samples in the data set enhanced the model performance. However, there was also a substantial increase in the model training time from 19.370 min to 529.56 min. This was due to the mini-batch size value and number of samples in the data set. The size of the mini-batch is constant in the Faster R-CNN architecture and must Table 6 Performance results of the Faster R-CNN model in the tympanic membrane detection task. Data set Epoch AP # of missing samples Train Time [min] Test Time [min] Original eardrum data set 10 15 20 25 50 100 0.4015 0.5007 0.6079 0.6772 0.6026 0.6325 31 25 17 8 16 9 7.793 11.380 14.533 19.370 36.173 70.142 0.0667 0.0701 0.0668 0.0753 0.0646 0.0815 Augmented eardrum data set 10 15 20 25 50 100 0.5827 0.6330 0.5627 0.6879 0.6998 0.7585 98 72 83 30 30 34 60.412 84.078 113.986 137.284 272.18 529.56 0.4657 0.5243 0.4478 0.4484 0.4552 0.5220 Augmented eardrum data set with Gauss noise 10 15 20 25 50 100 0.7675 0.7095 0.7545 0.7694 0.7400 0.6235 8 12 1 4 0 2 87.101 85.651 110.171 135.089 268.187 555.808 0.4519 0.4324 0.4439 0.4346 0.4394 0.5141 Augmented eardrum data set with salt & pepper noise 10 15 20 25 50 100 0.7162 0.6720 0.7952 0.7342 0.7408 0.6476 0 1 1 1 0 6 52.4371 75.2132 111.467 142.931 266.220 552.20 0.4661 0.4767 0.4215 0.4264 0.4198 0.4772 8 E. Başaran, Z. Cömert and Y. Çelik / Biomedical Signal Processing and Control 56 (2020) 101734 Fig. 4. Precision and recall curves considering the maximum number of epochs for the (a) original data set, (b) augmented data set, (c) augmented data set with Gaussian noise, and (d) augmented data set with salt and pepper noise. equal one. PR curves of the model incorporating the experimental setups are illustrated in Fig. 4. The experimental results show that the Faster R-CNN model is a useful approach to detect the tympanic membrane from the digital otoscope images. Some of the representative results of this approach are illustrated in Fig. 5. Well-known, pre-trained deep CNN models were employed in the classification task. First, the augmented data set composed of the original otoscope images was separately input to the deep CNN models. The models were then fed with the automatically detected tympanic membrane patches from Faster R-CNN. The samples where the tympanic membrane was not detected were used in their original forms. The initial learning rate, learn drop factor, and learn drop period were adjusted to 0.0001, 0.1 and 8, respectively, after numerous experiments to achieve an efficient configuration for the deep models. Thus, the values of the above-mentioned parameters were determined by trial and error. The SGDM optimizer was employed for all the models. The maximum number of epochs and mini-batch sizes were set to 32 and 16, respectively in all the models. This resulted in 74 samples per epoch and 2368 maximum iterations were required to train the models. Furthermore, the base learning rate was updated four times during the training phase as the learn drop factor was adjusted to 0.1 and the learn drop period was set to eight. This resulted in the initial base learning having initial and final values of 1 × 10−4 and 1 × 10−8 , respectively. Another point to be considered is that a predictable sequence of numbers was used when the data was divided into the training and test sets. The basic aim of this process was to apply the same samples to the models. The training and validation processes of the deep CNN models fed with the original data set are illustrated in Fig. 6. It can be clearly seen that the models required ∼1000 iterations to achieve convergence in 14 epochs, with no significant improvement beyond this point. This indicates that the training time of the models can be reduced significantly. All models yielded promising results. The confusion matrices and performance metrics of the models fed with the original samples are reported in Tables 7 and 8, respectively. AlexNet achieved the best classification accuracy of 87.97%. VGG-16, VGG19, GoogLeNet, ResNet-50, and ResNet-101 achieved accuracies of 86.98%, 86.00%, 86.19%, 84.62%, and 85.01%, respectively. The sensitivity measure in this classification task is rather significant as it shows the performance of the models in distinguishing the abnormal tympanic membrane samples. The ResNet-101 model was at the forefront with a sensitivity of 82.17%. However, AlexNet was superior to the other models in terms of the performance and time required for training. The AlexNet model was trained in only 5.32 min. The automatically detected membrane patches were input to the models in the second stage of the classification task. The train- E. Başaran, Z. Cömert and Y. Çelik / Biomedical Signal Processing and Control 56 (2020) 101734 9 Fig. 5. A subset of the automatically detected tympanic membranes from the digital otoscope images. Table 7 Confusion matrices of the CNN models fed with the original otoscope images. AlexNet 187a 43c (a) VGG-16 18b 259d 186a 44c VGG-19 22b 255d 186a 44c GoogLeNet 27b 250d 186a 44c ResNet-50 26b 251d 177a 53c ResNet-101 25b 252d 189a 41c 35b 242d TP, True positive, (b) FP, False Positive, (c) FN, False Negative, (d) TN, True Negative. Table 8 Performance results of the CNN models fed with the original otoscope images. Model Acc (%) Se (%) Sp (%) F-score (%) AUC Train Time [min] AlexNet VGG-16 VGG-19 GoogLeNet ResNet-50 ResNet-101 87.97 86.98 86.00 86.19 84.62 85.01 81.30 80.87 80.87 80.87 76.96 82.17 93.50 92.06 90.25 90.61 90.98 87.37 85.98 84.93 83.97 84.16 81.94 83.26 0.9392 0.9478 0.9346 0.9393 0.9210 0.9323 5.32 17.36 19.11 10.12 20.84 26.32 Table 9 Confusion matrices of the CNN models fed with the detected tympanic membrane patches. AlexNet 182a 48c (a) VGG-16 14b 263d 199a 31c VGG-19 27b 250d 187a 43c GoogLeNet 28b 249d TP, True positive, (b) FP, False Positive, (c) FN, False Negative, and (d) TN, True Negative. 190a 40c ResNet-50 35b 242d 179a 51c ResNet-101 33b 244d 179a 51c 37b 240d 10 E. Başaran, Z. Cömert and Y. Çelik / Biomedical Signal Processing and Control 56 (2020) 101734 Fig. 6. The training and validation accuracy and loss of the pre-trained deep CNN models fed with the original data set. (a) Training accuracy, (b) Training loss, (c) Validation accuracy, and (d) Validation loss of the models. Table 10 Performance results of the CNN models fed with the detected tympanic membrane patches. Model Acc (%) Se (%) Sp (%) F-score (%) AUC Train Time [min] AlexNet VGG-16 VGG-19 GoogLeNet ResNet-50 ResNet-101 87.77 88.56 86.00 85.21 83.43 82.64 79.13 86.52 81.30 82.61 77.82 77.83 94.95 90.25 89.89 87.37 88.08 86.64 85.45 87.28 84.05 83.52 81.00 80.27 0.9507 0.9537 0.9392 0.9295 0.9057 0.9004 5.66 17.20 18.95 9.31 19.80 44.93 ing and validation processes of the models are illustrated in Fig. 7. The confusion matrices and performance results of the models are shown in Tables 9 and 10, respectively. The experimental results show that all the deep CNN models provided satisfactory results. The VGG-16 model had the best classification performance with an accuracy, sensitivity, and specificity of 88.56%, 86.52%, and 90.25%, respectively. The AlexNet, VGG-19, GoogLeNet, ResNet-50, and ResNet-101 models had an accuracy of 87.77%, 86.00%, 85.21%, 83.43%, and 82.64%, respectively. A time duration of 17.20 min was required to train the VGG-16 model. The experimental results indicate that feeding the deep CNN models with the automatically detected tympanic membrane patches elevated the model performances. Only the performance results of the ResNet model decreased. Fig. 8 shows the AUCs of the models for the two experimental setups of the classification tasks. Fig. 8 indicates that the VGG-16 model achieved the best AUC value of 0.9537, with the other deep CNN models having AUC values > 0.90. Lastly, additional experiments were conducted to validate the model success. In these experiments, we considered the original and automatically detected patches separately by splitting the data with a different dividing rate, to obtain 50% training and 50% test sets and a 10-fold cross-validation technique. Table 11 shows the results from the experiments wherein the deep pre-trained models were fed with the original OM samples, while considering the 50% training and 50% test data division rates and 10-fold cross-validation technique. All models achieved satisfactory results, with the VGG-16 model having a superior classification accuracy of 90.24%. In the final experiment, the data set was equally divided into the training and test sets, and the 10-fold cross-validation was also examined. The VGG-16 model had the most efficient results with E. Başaran, Z. Cömert and Y. Çelik / Biomedical Signal Processing and Control 56 (2020) 101734 11 Fig. 7. Training and validation accuracy, and loss of the pre-trained deep CNN models fed with the automatically detected tympanic membrane patches. (a) Training accuracy, (b) Training loss, (c) Validation accuracy, and (d) Validation loss of the models. Fig. 8. The ROC curves of the models. (a) The models fed with the original samples. (b) The models fed with the automatically detected tympanic membrane patches. 12 E. Başaran, Z. Cömert and Y. Çelik / Biomedical Signal Processing and Control 56 (2020) 101734 Table 11 Classification results of the models fed with the original images. Models AlexNet VGG-16 VGG-19 GoogLeNet ResNet-50 ResNet-101 The results of the 50% training and 50% test data division rates The results of the 10-fold cross-validation Acc (%) 82.62 83.33 82.27 82.86 81.79 83.33 Acc (%) 86.64 90.24 90.24 88.00 87.47 89.06 Se (%) 71.09 77.86 72.65 76.56 72.91 79.16 Sp (%) 92.20 87.87 90.26 88.09 89.17 86.79 Se (%) 79.03 86.24 85.80 85.54 85.28 87.89 Sp (%) 92.96 93.07 93.93 90.04 89.28 90.04 Table 12 Classification results of the models. Models AlexNet VGG-16 VGG-19 GoogLeNet ResNet-50 ResNet-101 The results of the 50% training and 50% test data division rates The results of the 10-fold cross-validation Acc (%) 85.22 85.93 85.81 82.97 80.96 80.37 Acc (%) 87.29 90.48 90.01 86.76 84.39 84.22 Se (%) 77.08 80.99 78.90 77.86 74.47 77.86 Sp (%) 91.99 90.04 91.55 87.22 86.36 82.46 Se (%) 80.85 86.84 86.97 83.98 81.64 80.46 Sp (%) 92.64 93.50 92.53 89.06 86.68 87.33 Table 13 A comparison between the state-of-the-art methods used for the computational diagnosis of otitis media. Authors Methods # of samples # of class Acc (%) 2011, Mironica et al. [31] 2011, Vetan et al. [27] 2013, Kuruvilla et al. [26] 2014, Chuen-Kai Shie et al. [30] 2016, Hermanus et al. [33] 2017, Huang and Huang [29] 2018, Hermanus et al. [34] 2019, This paper Global image features, kNN, DT, LDA, Naïve Bayes, MLN, SVM The color data distribution, Bayesian decision rule Vocabulary and grammar, DT Active contour segmentation, LBP and HOG features, and AdaBoost Visual features, DT Image processing, visual features, a depth-First search algorithm Visual features, DT, neural networks Faster R-CNN, pretrained CNN models, VGG-16, 10-fold cross-validation 186 100 181 865 486 20 389 1692 2 3 3 4 5 3 5 2 73.11 59.90 89.9 88.06 80.61 70.00 81.58 90.48 a classification accuracy of 90.48%. The 10-fold cross-validation method and models with the automatically detected tympanic membrane input patches increased the model performance, with the results shown in Table 12. 5. Discussion All related studies were compared by taking the various data sets, methods, and accuracy performance metrics shown in Table 13 into consideration. However, an exact comparison was not possible due to the use of different data sets and methods. Table 13 clearly shows that the researchers mostly classified the OM diseases based on a combination of the visual features from the eardrum images and using conventional machine learning techniques. In this study, we developed an end-to-end deep learning model that automatically focuses on tympanic membrane and eliminates the feature extraction and selection processes. The model directly makes a decision based on the input otoscope images. The tympanic membrane contains most of the visual cues of the underlying disorder in the otoscope images. Furthermore, the black background around the circular field of view of the endoscopic camera is the most visible feature in the images [27]. This has led some researchers to employ segmentation techniques such as the active contour to separate the tympanic membrane from the input otoscope image [30]. Some studies have applied this process manually [27]. In our proposed approach, Faster R-CNN was devised to enable selection of the tympanic membrane. Furthermore, pre-trained CNN models were adopted to classify the digital otoscope images without the cumbersome feature extraction and selection processes. We have, thus achieved a fully heuristic diagnostic model for otology in this study. The successful implementation of the deep CNN models in clinical decision support systems is difficult as the models require reasonably large-scale data sets to ensure robust diagnostic outcomes [19]. The collection of large-scale data from the clinics is quite an expensive and difficult task, and at times may even be impossible [21]. Decision support systems have not been actively adopted by practitioners in the field of otology, as the importance of diagnostic systems has not yet be fully comprehended. However, the use of a consistent and heuristic diagnostic system in daily clinical applications offers several advantages such as increasing the physician diagnostic accuracy, reducing the misdiagnosis rate, supporting the decision-making process, and conducting standard and objective examinations. 6. Conclusion OM is the technical term for the inflammation of the middle ear. The eardrum is examined visually in clinical practice. This leads to limited variability among the observers during diagnosis, includes human-induced errors, and is subjective. It is evident that computerized methods for the diagnosis of clinical OM are not yet sufficiently widespread. We addressed the above-mentioned disadvantages as follows: (1) First, a new publicly accessible eardrum data set comprising 282 tympanic membrane images was introduced. (2) A Faster R-CNN model was proposed to select automatically the tympanic membrane from the digital otoscope images, as this membrane contains most of the visual cues of the underlying disorder. This led us to achieve an AP of 75.85% in the tympanic membrane detection task. (3) Pre-trained deep CNN models were adopted to distinguish between the abnormal and normal samples, without the E. Başaran, Z. Cömert and Y. Çelik / Biomedical Signal Processing and Control 56 (2020) 101734 cumbersome feature extraction and selection processes. The AlexNet, VGGNets, GoogLeNet, and ResNets models were used for this specific task. All models yielded satisfactory results, with the best accuracy of 90.48% achieved by the VGG-16 model. (4) A consistent diagnosis model for tympanic membrane detection and classification was thus developed. This approach can potentially be used in future otological clinical decision support systems to enhance the diagnostic accuracy and reduce the overall rate of misdiagnosis. Future studies would focus on increasing the number of samples in the eardrum data set to cover a full range of ontological conditions. This would enable us to realize a multi-class classification task in OM diagnosis. Focus would also be on the activation maps in the deep CNN models to describe the samples and feed the shallow networks. [9] [10] [11] [12] [13] Ethical approval This article does not contain any data, or other information from studies or experimentation, with the involvement of human or animal subjects. Data availability The otoscope images used in this study can be freely downloaded from http://www.ctganalysis.com/Category/otitis-media. The annotations of the experts can be found on the same web site. Funding There is no funding source for this article. Appendix A. Supplementary data Supplementary material related to this article can be found, in the online version, at doi:https://doi.org/10.1016/j.bspc.2019. 101734. [14] [15] [16] [17] [18] [19] [20] [21] [22] Declaration of Competing Interest The authors declare that there is no conflict to interest related to this paper. [23] [24] References [1] H.C. Myburgh, S. Jose, D.W. Swanepoel, C. Laurent, Towards low cost automated smartphone- and cloud-based otitis media diagnosis, Biomed. Signal Process. Control 39 (2018) 34–52, http://dx.doi.org/10.1016/j.bspc. 2017.07.015. [2] E.B. Edetanlen, B.D. Saheeb, Otitis media with effusion in Nigerian children with cleft palate: incidence and risk factors, Br. J. Oral Maxillofac. Surg. (2018), http://dx.doi.org/10.1016/j.bjoms.2018.11.015. [3] A. Kørvel-Hanquist, A. Koch, J. Lous, S.F. Olsen, P. Homøe, Risk of childhood otitis media with focus on potentially modifiable factors: a Danish follow-up cohort study, Int. J. Pediatr. Otorhinolaryngol. 106 (2018) 1–9, http://dx.doi. org/10.1016/j.ijporl.2017.12.027. [4] R.C. Di Francesco, V.B. Barros, R. Ramos, Otitis media with effusion in children younger than 1 year, Rev. Paul. Pediatr. English Ed. 34 (2016) 148–153, http:// dx.doi.org/10.1016/j.rppede.2016.01.003. [5] S. Mousseau, A. Lapointe, J. Gravel, Diagnosing acute otitis media using a smartphone otoscope; a randomized controlled trial, Am. J. Emerg. Med. 36 (2018) 1796–1801, http://dx.doi.org/10.1016/j.ajem.2018.01.093. [6] R.H. Eikelboom, M.N. Mbao, H.L. Coates, M.D. Atlas, M.A. Gallop, Validation of tele-otology to diagnose ear disease in children, Int. J. Pediatr. Otorhinolaryngol. 69 (2005) 739–744, http://dx.doi.org/10.1016/j.ijporl.2004. 12.008. [7] A. Coleman, A. Cervin, Probiotics in the treatment of otitis media. The past, the present and the future, Int. J. Pediatr. Otorhinolaryngol. 116 (2019) 135–140, http://dx.doi.org/10.1016/j.ijporl.2018.10.023. [8] N.H. Davidoss, Y.K. Varsak, P.L. Santa Maria, Animal models of acute otitis media – a review with practical implications for laboratory research, Eur. [25] [26] [27] [28] [29] [30] [31] 13 Ann. Otorhinolaryngol. Head Neck Dis. 135 (2018) 183–190, http://dx.doi.org/ 10.1016/j.anorl.2017.06.013. J. Pitaro, S. Waissbluth, M.-C. Quintal, A. Abela, A. Lapointe, Characteristics of children with refractory acute otitis media treated at the pediatric emergency department, Int. J. Pediatr. Otorhinolaryngol. 116 (2019) 173–176, http://dx. doi.org/10.1016/j.ijporl.2018.10.045. E. Roy, K.Z. Hasan, F. Haque, A.K.M. Siddique, R.B. Sack, Acute otitis media during the first two years of life in a rural community in Bangladesh: a prospective cohort study, J. Heal. Popul. Nutr. 25 (2007) 414–421. A. Büyükcam, A. Kara, T. Bedir, B. Gülhan, H. Özdemir, M. Sütçü, M. Düzgöl, A. Arslan, T. Tekin, S. Çelebi, M.G. Kukul, G.İ. Bayhan, M. Köşker, A. Karbuz, M. Çelik, Z.K. Sütçü, Ö. Metin, S. Karakaşlılar, A. Dağlı, S.S. Kara, E. Albayrak, S. Kanık, H. Tezer, A. Parlakay, E. Çiftci, A. Somer, İ. Devrim, Z. Kurugöl, E.Ç. Dinleyici, P. Atla, Pediatricians’ attitudes in management of acute otitis media and ear pain in Turkey, Int. J. Pediatr. Otorhinolaryngol. 107 (2018) 14–20, http://dx.doi.org/10.1016/j.ijporl.2018.01.011. K. Kitamura, Y. Iino, Y. Kamide, F. Kudo, T. Nakayama, K. Suzuki, H. Taiji, H. Takahashi, N. Yamanaka, Y. Uno, Clinical Practice Guidelines for the diagnosis and management of acute otitis media (AOM) in children in Japan – 2013 update, Auris Nasus Larynx 42 (2015) 99–106, http://dx.doi.org/10.1016/j.anl. 2014.09.006. M.E. Pichichero, Diagnostic accuracy of otitis media and tympanocentesis skills assessment among pediatricians, Eur. J. Clin. Microbiol. Infect. Dis. 22 (2003) 519–524, http://dx.doi.org/10.1007/s10096-003-0981-8. S. Shah-Becker, M.M. Carr, Current management and referral patterns of pediatricians for acute otitis media, Int. J. Pediatr. Otorhinolaryngol. 113 (2018) 19–21, http://dx.doi.org/10.1016/j.ijporl.2018.06.036. A. Aksoy, E. Ayhan, Orta kulak efüzyonlarında timpanogram ile otoskopik bulguların karşılaştırılması, Dicle Med. J. 40 (2013) 54–56, http://dx.doi.org/ 10.5798/diclemedj.0921.2013.01.0224. I.P.O. Timoty Els, The Prevalence and Impact of Otitis Media With Effusion in Children Admittedfor Adeno-tonsillectomy at Dr George Mukhari Academic Hospital, Pretoria,South Africa, 2018, pp. 76–80. N.S. Tsilis, P.V. Vlastarakos, V.F. Chalkiadakis, D.S. Kotzampasakis, T.P. Nikolopoulos, Chronic otitis media in children: an evidence-based guide for diagnosis and management, Clin. Pediatr. (Phila) 52 (2013) 795–802, http:// dx.doi.org/10.1177/0009922813482041. A.S. Lieberthal, A.E. Carroll, T. Chonmaitree, T.G. Ganiats, A. Hoberman, M.A. Jackson, M.D. Joffe, D.T. Miller, R.M. Rosenfeld, X.D. Sevilla, et al., The diagnosis and management of acute otitis media, Pediatrics (2013), peds–2012. L.S. Goggin, R.H. Eikelboom, M.D. Atlas, Clinical decision support systems and computer-aided diagnosis in otology, Otolaryngol. Neck Surg. 136 (2007) s21–s26, http://dx.doi.org/10.1016/j.otohns.2007.01.028. A.H, J.K. Anupama Kuruvilla, Jian Li, Pablo Hennings Yeomans, Pedro Quelhas, Nader Shaikh, Otitis media vocabulary and grammar, Media (2012) 2845–2848. Z. Cömert, A.F. Kocamaz, Fetal hypoxia detection based on deep convolutional neural network with transfer learning approach, in: R. Silhavy (Ed.), Softw. Eng. Algorithms Intell. Syst., Springer International Publishing, Cham, 2019, pp. 239–248, http://dx.doi.org/10.1007/978-3-319-91186-1 25. A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, in: F. Pereira, C.J.C. Burges, L. Bottou, K.Q. Weinberger (Eds.), Proc. 25th Int. Conf. Neural Inf. Process. Syst. - Vol. 1, Curran Associates, Inc., USA, 2012, pp. 1097–1105 http://dl.acm.org/citation. cfm?id=2999134.2999257. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, ArXiv Prepr (2014), ArXiv1409.1556. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, L. Fei-Fei, ImageNet large scale visual recognition challenge, Int. J. Comput. Vis. 115 (2015) 211–252, http:// dx.doi.org/10.1007/s11263-015-0816-y. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Las Vegas, NV, USA (2016) 770–778, http://dx.doi.org/10.1109/CVPR.2016.90. A. Kuruvilla, N. Shaikh, A. Hoberman, J. Kovačević, Automated diagnosis of otitis media: vocabulary and grammar, J. Biomed. Imaging 2013 (2013) 27. C. Vertan, D.C. Gheorghe, B. Ionescu, Eardrum color content analysis in video-otoscopy images for the diagnosis support of pediatric otitis, ISSCS 2011 - Int. Symp. Signals, Circuits Syst. Proc. (2011) 129–132, http://dx.doi. org/10.1109/ISSCS.2011.5978676. H. Junior, E. Comunello, S. Costa, C.C. Dornelles, Computational techniques for accompaniment and measuring of otology pathologies, in: Twent. IEEE Int. Symp. Comput. Med. Syst., IEEE, Maribor, Slovenia, 2007. Y.K. Huang, C.P. Huang, A depth-first search algorithm based otoscope application for real-time otitis media image interpretation, Parallel Distrib. Comput. Appl. Technol. PDCAT Proc. 2017-Decem (2018) 170–175, http://dx. doi.org/10.1109/PDCAT.2017.00036. C.K. Shie, H.T. Chang, F.C. Fan, C.J. Chen, T.Y. Fang, P.C. Wang, A hybrid feature-based segmentation and classification system for the computer aided self-diagnosis of otitis media, 2014 36th Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. EMBC 2014 (2014) 4655–4658, http://dx.doi.org/10.1109/EMBC.2014. 6944662. I. Mironica, C. Vertan, D.C. Gheorghe, Automatic pediatric otitis detection by classification of global image features, 2011 E-Health Bioeng. Conf. (2011) 1–4. 14 E. Başaran, Z. Cömert and Y. Çelik / Biomedical Signal Processing and Control 56 (2020) 101734 [32] L. Cheng, J. Liu, C.E. Roehm, T.A. Valdez, Enhanced video images for tympanic membrane characterization, Proc. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. EMBS (2011) 4002–4005, http://dx.doi.org/10.1109/IEMBS.2011.6090994. [33] H.C. Myburgh, W.H. van Zijl, D. Swanepoel, S. Hellström, C. Laurent, Otitis media diagnosis for developing countries using tympanic membrane image-analysis, EBioMedicine 5 (2016) 156–160, http://dx.doi.org/10.1016/J. EBIOM.2016.02.017. [34] H.C. Myburgh, S. Jose, D.W. Swanepoel, C. Laurent, Towards low cost automated smartphone- and cloud-based otitis media diagnosis, Biomed. Signal Process. Control 39 (2018) 34–52, http://dx.doi.org/10.1016/j.bspc. 2017.07.015. [35] Y. Guo, Ü. Budak, L.J. Vespa, E. Khorasani, A. Şengür, A retinal vessel detection approach using convolution neural network with reinforcement sample learning strategy, Measurement 125 (2018) 586–591, http://dx.doi.org/10. 1016/j.measurement.2018.05.003. [36] D. Macêdo, C. Zanchettin, A.L.I. Oliveira, T. Ludermir, Enhancing batch normalized convolutional networks using displaced rectifier linear units: a systematic comparative study, Expert Syst. Appl. 124 (2019) 271–281, http:// dx.doi.org/10.1016/j.eswa.2019.01.066. [37] C. Xu, J. Yang, H. Lai, J. Gao, L. Shen, S. Yan, UP-CNN: un-pooling augmented convolutional neural network, Pattern Recognit. Lett. 119 (2019) 34–40, http://dx.doi.org/10.1016/j.patrec.2017.08.007. [38] A. Lumini, L. Nanni, Deep learning and transfer learning features for plankton classification, Ecol. Inform. 51 (2019) 33–43, http://dx.doi.org/10.1016/j. ecoinf.2019.02.007. [39] L. Wang, Y. Yang, R. Min, S. Chakradhar, Accelerating deep neural network training with inconsistent stochastic gradient descent, Neural Netw. 93 (2017) 219–229, http://dx.doi.org/10.1016/j.neunet.2017.06.003. [40] M.D. Zeiler, ADADELTA: An Adaptive Learning Rate Method, 2012, http://dx. doi.org/10.1145/1830483.1830503. [41] R. Shindjalova, K. Prodanova, V. Svechtarov, Modeling data for tilted implants in grafted with bio-oss maxillary sinuses using logistic regression, AIP Conf. Proc. (2014) 58–62, http://dx.doi.org/10.1063/1.4902458, 1631. [42] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: 2015 IEEE Conf. Comput. Vis. Pattern Recognit., IEEE, 2015, pp. 1–9, http://dx.doi.org/10. 1109/CVPR.2015.7298594. [43] R.B. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, 2014 IEEE Conf. Comput. Vis. Pattern Recognit. (2014) 580–587. [44] C.L. Zitnick, P. Dollár, Edge Boxes, Locating object proposals from edges, in: D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Eds.), Comput. Vis. — ECCV 2014, Springer International Publishing, Cham, 2014, pp. 391–405. [45] R. Girshick, Fast R-CNN, Proc. IEEE Int. Conf. Comput. Vis. (2015) 1440–1448. [46] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell. 39 (2017) 1137–1149, http://dx.doi.org/10.1109/TPAMI.2016.2577031. [47] P. Ding, Y. Zhang, W.-J. Deng, P. Jia, A. Kuijper, A light and faster regional convolutional neural network for object detection in optical remote sensing images, ISPRS J. Photogramm, Remote Sens. 141 (2018) 208–218, http://dx. doi.org/10.1016/j.isprsjprs.2018.05.005. [48] Ü. Budak, Ö.F. Alçin, M. Aslan, A. Şengür, Optic disc detection in retinal images via faster regional convolutional neural networks, 1st Int. Eng. Technol. Symp. (2018). [49] Z. Cömert, A.F. Kocamaz, V. Subha, Prognostic model based on image-based time-frequency features and genetic algorithm for fetal hypoxia assessment, Comput. Biol. Med. (2018), http://dx.doi.org/10.1016/j.compbiomed.2018.06. 003. [50] T.C.W. Landgrebe, R.P.W. Duin, Efficient multiclass ROC approximation by decomposition via confusion matrix perturbation analysis, IEEE Trans. Pattern Anal. Mach. Intell. 30 (2008) 810–822, http://dx.doi.org/10.1109/TPAMI.2007. 70740.