Steganalysis of Audio Based on Audio Quality Metrics

Steganalysis of Audio Based on Audio Quality Metrics Hamza Özera,d, İsmail Avcıbaşb, Bülent Sankura, Nasir Memonc Department of Electrical and Electronics Engineering, Boğaziçi University, Bebek, İstanbul, Turkey b Department of Electronics Engineering, Uludağ University, Görükle, Bursa, Turkey c Department of Computer and Information Science, Polytechnic University, Brooklyn, NY, USA d National Research Institute of Electronics and Cryptology, Marmara Research Center, Gebze, Turkey a ABSTRACT Classification of audio documents as bearing hidden information or not is a security issue addressed in the context of steganalysis. A cover audio object can be converted into a stego-audio object via steganographic methods. In this study we present a statistical method to detect the presence of hidden messages in audio signals. The basic idea is that, the distribution of various statistical distance measures, calculated on cover audio signals and on stego-audio signals vis-àvis their denoised versions, are statistically different. The design of audio steganalyzer relies on the choice of these audio quality measures and the construction of a two-class classifier. Experimental results show that the proposed technique can be used to detect the presence of hidden messages in digital audio data. Keywords: Steganalysis, watermarking, audio quality measures, feature selection, support vector machine 1. INTRODUCTION Given the proliferation of digital multimedia data and the inherent redundancy, despite compression, in digital documents, there has been an interest in using multimedia for the purpose of steganography. Steganography exploits the covert channel that can be made possible in multimedia documents and aims to achieve message communication, whose effect and presence must go totally undetected except by the intended recipient. In other words steganographic techniques strive to hide the occurrence of a message communication [20]. To achieve secure and undetectable communication, stego-objects, documents containing a secret message, should be indistinguishable from cover-objects, documents not containing any secret message. In this respect, steganalysis is the set of techniques that aim to distinguish between cover-objects and stego-objects. Steganalysis itself can be implemented in either a passive warden or active warden style. A passive warden simply examines the message and tries to determine if it potentially contains a hidden message. If it appears that it does, then the document is stopped; otherwise it will go through. An active warden, on the other hand, can alter messages deliberately, even though there may not see any trace of a hidden message, in order to foil any secret communication that can nevertheless be occurring. The amount of change the warden should not exceed a point where subjective quality of the suspected stego audio track is altered significantly. In this paper we are mainly concerned with passive warden steganalysis. Specifically, we intend to develop steganalysis tools specifically for the audio documents. It should be noted that although there has been quite some effort in the steganalysis of digital images [1, 8, 11, 27], steganalysis of digital audio is relatively unexplored.  This work was partially supported by TÜBİTAK Project 102E018, and Boğaziçi Research Fund project 01A201. Nasir Memon was also supported by AFOSR Award Number F49620-01-1-0243 The idea underlying our audio steganalysis is the fact that any steganographic technique will invariably perturb the statistics of the cover signal to some extent. In fact, for the additive class of message embedding techniques, the presence of steganographic communication in a signal can be modeled as additive noise in time or frequency domain. Even for non-additive techniques, such as substitutive embedding, the difference between the stego-signal (y) and the cover signal (x), can be considered to be additively combined with the cover signal, y = x + (y-x), albeit in a signaldependent manner. The presence of the steganographic artifact can then be put into evidence by recovering the original cover signal, or alternatively, by de-noising the suspected stego-signal. The steganalyzer can directly apply a statistical ˆ , where x̂ is the estimated original signal. This residual must also correspond to test on the denoising residual, x  x the artifact due to embedding of a hidden message. Notice that, even if the test signal does not contain any hidden message, the de-noising step will still yield an output, whose statistics can be expected, however, to be different from those of a true embedding. Alternately the steganalyzer can be constructed as a “distortion meter” between the test signal and the estimated original signal, x̂ , using again denoising. For this purpose, one can use various “audio signal quality measures” to monitor the extent of steganographic distortion. Here, we implicitly assume that the distance between a smooth signal and its denoised version is less than the distance between a noisy signal and its denoised version. The implicit assumption is that any embedding effort will render the signal less predictable and smooth. The perturbations, due to the presence of embedding, translate to the feature space, where the “audio quality features” plot in different parts of the feature space between the marked and non-marked signals. An alternate way to sense the presence of “marking” would be to monitor the change in the predictability of a signal, temporally and/or across scales. In the proposed steganalysis method, as presented in Fig. 1, the underlying idea is to first isolate the stego-signal by subtracting the estimated original signal from the given test signal. The original signal itself is estimated by some denoising scheme or by a model-based algorithm. In the case of additive message embedding, de-noising algorithms become effective tools for the estimation of the original, non-marked signal [24]. We have used the wavelet shrinkage method [6] to reduce the noise in the expectation of removing the hidden message. In this scheme, a soft-thresholding nonlinearity is applied to the wavelet coefficients such that its inverse wavelet transform yields the denoised signal. An alternate scheme would be to use sparse code shrinkage based on independent component analysis. [9]. For nonadditive message embedding, the original signal can be estimated via maximum likelihood (ML) or maximum a posteriori (MAP) approaches. Incoming document Extract hidden message Feature selection Decision and classification Decision: marked or not marked Figure 1. Block diagram of the steganalysis method. The rest of this paper is organized as follows. In Section 2, we discuss audio quality measures. In Section 3 we address the problem of feature selection for the two-class problem. Section 3 also discusses the classifiers used on a comparative basis. The test database, experiments conducted, and the main results of classification and steganalysis are given in Section 4 Finally, conclusions are drawn in Section 6. 2. SELECTION OF AUDIO FEATURES FOR STEGANALYSIS In this section, we investigate several audio quality measures for the purpose of audio steganalysis. We consider an audio quality metric to be, in fact, as a functional that converts its input signal into a measure that purportedly is sensitive to the presence of a steganographic message embedding. We search for measures that reflect the quality of distorted or degraded audio signal vis-à-vis its original in an accurate, consistent and monotonic way. Such a measure, in the context of steganalysis, should respond to the presence of hidden message with minimum error, should work for a large variety of embedding methods, and its reaction should be proportional to the embedding strength. We consider two categories of measures that quantify signal distortion, namely perceptual and non-perceptual. In the perceptual category, the distortion measures are specific to the speech/audio and take into consideration the properties of the human auditory system. All audio quality measures share a two-component structure. The first one is a perceptual transformation module, which transforms the input into a perceptually relevant domain such as temporal, spectral, or loudness. The domain filters typically take into account the psychoacoustic models. The second component is the cognition/judgment module, which compares the two perceptually transformed signals in order to generate an estimated distortion. In the non-perceptual category, the measures try to penalize the distance between a test signal and its reference signal. The basis of comparison varies from Euclidean distance to artificial neural network or fuzzy logic. Although most of them are developed for speech quality, they can be easily extended to audio band. The audio quality measures tested for the design of the steganalyzer are listed in Table 1. Their detailed descriptions can be found in Appendix A. Table 1: Audio quality measures tested for the design of the steganalyzer Non-perceptual domain measures Perceptual-domain measures Time-domain measures Frequency-domain measures Bark Spectral Distortion (BSD) Signal-to-noise ratio (SNR) Log-Likelihood ratio (LLR) Modified Bark Spectral Distortion (MBSD) Segmental signal-to-noise ratio Log-Area ratio (LAR) (SNRseg) Enhanced Modified Bark Spectral Distortion (EMBSD) Czenakowski distance (CZD) Itakura-Saito distance (ISD) Perceptual Speech Quality Measure (PSQM) COSH distance (COSH) Perceptual Audio Quality Measure (PAQM) Cepstral distance (CD) Measuring Normalizing Block 1 (MNB1) Short-Time Fourier-Radon Transform distance (STFRT) Measuring Normalizing Block 2 (MNB2) Spectral Phase Distortion (SP) Weighted Slope Spectral distance (WSS) Spectral Phase-Magnitude Distortion (SPM) 3. FEATURE SELECTION and CLASSIFIER DESIGN It has been observed that filtering an audio signal with no watermark message causes changes in the quality metrics differently than that of an embedded audio signal. For feature selection we have used two approaches, namely, analysis of variance (ANOVA) [19] and Sequential Floating Search method [17]. Analysis of Variance (ANOVA) The analysis of variance, known as ANOVA, is a general technique for statistical hypothesis testing, often used when an experiment contains a number of groups or conditions, and one wants to see whether there are any statistically significant differences between them. The most general basic hypothesis is: H0 : 1  2  ..........  k H1 : at least one i   j for i  j. For N pieces of data, an F test with k-1 and N-k degrees of freedom is applied. In our case k is equal to 2. A high F value indicates that at least one pair of means are not equal. The confidence level determines the threshold for the F. In this study we choose it as 95 percent. We performed the ANOVA test for each steganographic test separately; the results are given in Table 2. Sequential Floating Search Method (SFS) While ANOVA is testing each feature separately, the SFS algorithm takes into consideration their intercorrelation and tests features in ensembles [17]. The algorithm can be described as follows: - Choose best two features from the K features, which are those features yielding the best classification result; - Add the most significant feature from the remaining features set, where selection is made on the basis of the feature that contributes most to the classification result when considered all together; - Determine the least significant feature from the selected set by conditionally removing features one-by-one; check if the removal of the least significant one improves or reduces the classification result, if it improves, remove this feature and go to step 3, else do not remove this feature and go to step 2. - Stop when the number of selected features equals the number of features required. We have applied the SFS method to all data hiding methods first under the limitation of an equal number of features as in ANOVA, and then without any such constraint. The results are depicted in Table 3. Table 2: The discriminatory features determined by ANOVA per embedding method Methods SNR SNRs LLR LAR    FHSS   ECHO  DSSS DCTwHAS STEGA STOOLS COSH CDM ISD BSD MBSD EBSD WSSD PAQM          MNB1 MNB2 CZD          PSQM   SP SPM STFRT        Table 3: The discriminatory features selected, per embedding method, by the SFS method, (a) linear regression used for classification, (b) SVM is used for classification. a. Methods SNR SNRs LLR LAR DSSS    FHSS      ECHO  DCTwHAS COSH CDM ISD BSD MBSD EBSD WSSD PAQM           MNB2  CZD SP          STOOLS SNR  SNRs LLR LAR DSSS    FHSS    ECHO  COSH  ISD     STOOLS CDM BSD MBSD EBSD   WSSD PAQM   DCTwHAS STEGA MNB1  STEGA b. Methods PSQM          PSQM MNB1 MNB2       STFRT  CZD SP   SPM STFRT    SPM         As can be expected, there is substantial overlap between these tables, especially in the case of DSSS and FHSS watermarking, and yet there exist crucial differences. The spread spectrum Cox et. al. technique, (DCTwHAS) which was the most difficult to detect, necessitates more features than all other algorithms. In general passive warden techniques make use of fewer features as compared to active warden techniques. One can notice that PAQM and LLR features are in demand by a larger number of techniques. Another interesting note is that, the best individual feature, as determined by ANOVA, can be quite different from the feature set when their correlation is taken into account, as when determined by the SFS method. For example, the EBSD feature is in spotlight for ANOVA, while it is completely wiped out by the SFS selection technique. In order to classify the signals as “marked” or “not marked” based on the selected audio quality features, we tested and compared two types of classifiers, namely, linear regression analysis and support vector machines. Regression Analysis Classifier In the design of a regression classifier, we regressed the distance measure scores to, respectively, -1 and 1, depending upon whether the audio did not or did contain a hidden message. In the regression model [19], we expressed each g i   1,1 , i  1,.., N decision label as a linear combination of quality measures,   g i  1 f1i   2 f 2i  ...   q f qi , where F  ( f1i , f 2i ,... f qi ) is the vector of q-features selected and  1 ,  2 ,... q are the regression coefficients. The regression coefficients are predicted in the training phase, and than they are used in testing phase. In the test phase, the incoming audio signal is denoised and the selected quality metrics are calculated, then the distance measure is obtained by using the predicted regression coefficients. If the output exceeds the threshold 0, then the decision is that the audio contains message, otherwise the decision is that the audio dos not contain any message. Support Vector Machine Classifier The support vector method is based on an efficient multidimensional function optimization [23], which tries to minimize the empirical risk, which is the training set error. For the training feature data ( f i , gi ) , i  1,.., N , gi  [- 1,1], the feature vector F lies on a hyperplane given by w F  b  0 , where w is the normal to the hyperplane. The set of vectors is said to be optimally separated if it is separated without error and the distance between the closest vectors to the hyperplane is maximal. A separating hyperplane in canonical form, for the i’th feature vector and label, must satisfy the following constraints: T g i wFi   b  1, i  1,2,....N The distance d(w,b;F) of a feature vector F from the hyperplane (w,b) is, d ( w, b; F )  wT F  b w The optimal hyperplane is obtained by maximizing this margin. In our study we use a polynomial kernel function to separate the data. The tests and results are given in the next section. 4. EXPERIMENTAL RESULTS In our experiments we have experimented with overall six different data hiding algorithms, four watermarking and two LSB steganographic techniques. The watermarking techniques are direct-sequence spread spectrum (DSSS) [5], frequency hopping with spread spectrum (FHSS) [7], frequency masking technique with DCT (DCTwHAS) [7], and echo watermarking [5]. The steganographic methods are Steganos [21] and Stools [22]. These tools are selected on the basis of being most popular methods with readily available software. We use an audio data set of 100 records, where each one contains a three or four second sentence For testing and training, we used 100 audio records, by embedding we got 200 records, of which 100 are embedded and 100 kept as original records. A mixture containing 50 embedded records and 50 original records are used for training, while an independent similar mixture is kept for testing purposes. The test runs are repeated for the feature set selected by the ANOVA method and by the SFS method. Furthermore the two classification methods, linear regression and SVM, are separately run with the feature selection methods. 4.1 Detection results for individual algorithms First we present the detection results per individual embedding method. In other words, in each case the classifier has the task of classifying an audio track as to whether it bears marking or not. The results of the four combinations (two feature selection methods, two classification methods) are tabulated in Tables 4 and 5. One can notice that the proposed scheme, especially the feature selection based on Sequential Floating Search method, coupled with the Support Vector Machine classifier (the rightmost column in Table 5) achieves very satisfactory results. The DSSS, ECHO, STOOLS and, in a sense, FHSS methods of embedding can be detected with no error. The DCTwHAS proves to be the most difficult to track where we can only achieve a specificity of 80% only. Table 4: Test results using linear regression classifier for individual methods. Method DSSS FHSS ECHO DCTwHAS STEGA STOOLS ANOVA features Miss Det. 0/50 2/50 3/50 13/50 4/50 -/50 False Det. 0/50 1/50 5/50 12/50 4/50 -/50 SFS features (yielding max. detection) Miss Det. False Det. 0/50 0/50 1/50 0/50 0/50 1/50 5/50 8/50 0/50 6/50 0/50 9/50 Table 5: Test results using SVM classifier for individual methods. ANOVA features Method Miss Det. 2/50 4/50 0/50 18/50 3/50 -/50 DSSS FHSS ECHO DCTwHAS STEGA STOOLS False Det. 3/50 4/50 6/50 10/50 6/50 -/50 SFS features (yielding max. detection) Miss Det. False Det. 0/50 0/50 2/50 0/50 0/50 0/50 10/50 7/50 1/50 6/50 0/50 0/50 4.1 Detection for ensemble of algorithms We have carried out similar tests for the ensemble of four active warden schemes (DSSS, FHSS, ECHO, DCTwHAS), and separately, for the ensemble of two passive warden schemes (STEGA, STOOLS). In other words, when a document was presented, the detector must classify it into marked and non-marked, but without knowing which of the four activewarden methods or which of the two passive-warden methods was used for embedding. Different sets of 100 speech records were marked separately with each watermarking method, to result overall in 400 marked and 400 non-marked records. Similarly, different sets of 100 speech records were marked separately with each steganographic method, to result overall in 200 marked and 200 non-marked records. These were all equally shared between training and testing patterns. Here we also realize that the SFS-SVM combination gives the best outcome. Therefore only those results are presented in the Tables 6 and 7. Table 6: The discriminatory features determined by SFS for ensemble of methods with SVM as the classifier. Methods SNRs LLR LAR COSH Waterm.  SNR     Stegan.   CDM ISD BSD   MBSD EBSD WSSD PAQM CZD SP SPM STFRT           PSQM MNB1 MNB2 Table 7: Test results using SVM classifier for ensemble of methods. ANOVA features Method SFS features (yielding max. detection) Miss Det. False Det. Miss Det. False Det. Waterm. 30/200 44/200 28/200 34/200 Stegan. 20/100 35/100 9/100 18/100 5. CONCLUSION In this study, an audio steganalysis technique is proposed and tested. The objective audio quality measures, giving clues to the presence of hidden messages, are searched thoroughly. With the analysis of variance (ANOVA) method, we have determined the best individual features, and with sequential floating search (SFS) technique, we have selected features taking into account their intercorrelation. We have comparatively evaluated two classifiers, namely, linear regression and support vector machines. The SFS feature selection coupled with SVM classifier gave the best results. The output of the classifier is the decision whether the tested document carries any hidden information or not. The proposed method has been tested on four different watermarking and two steganographic data hiding techniques, first individually, and then in combinations. Experimental results show that the proposed scheme achieves satisfactory results, in that, in individual tests we achieve almost perfect detection except for the DCTwHAS watermarking method. In combination tests, the results are promising, but need to be improved. In this respect, we are investigating new ways of obtaining segment averages of quality measures, for example, using the power mean or the maximum. We will also consider a combination of classifiers; each specialized on a subset of data hiding methods. APPENDIX: AUDIO QUALITY MEASURES In this Appendix we give brief descriptions of the quality measures used. We categorize the measures into perceptual and non-perceptual groups, and furthermore, the non-perceptual group into time-domain and frequency-domain measures. The original signal (the cover document) is denoted y (i ), i  1,..N x(i ), i  1,..N while the distorted signal (the stego- document) as . In some cases the distortion is calculated from the overall data (SNR, CZD, SP, SPM). However most of the case, the distortion is calculated for small segments and by averaging these, the overall measure is obtained (SNRs, BSD, MBSD, EBSD, PAQM, PSQM, LLR, LAR, ISD, COSH, CDM, WSSD). Here segment size is taken to be 20ms (320 sample for 16kHz signal). The same size is used as windows size for the technique MNBs and STFRT. Time-Domain Measures: These measures (SNR, SNRseg, CZD) compare the two waveforms in the time domain. Segmental Signal-to-Noise Ratio (SNRseg): SNRseg is defined as the average of the SNR values over short segments: SNRseg  10 M M 1 Nm  N 1 m0 i  Nm  log 10    x 2 (i )   2   x(i )  y (i )   where x(i) is the original audio signal, y(i) is the distorted audio signal. The length of segments is typically 15 to 20 ms for speech. The SNRseg is applied for frames which have energy above a specified threshold in order to avoid silence regions. Signal-to-Noise Ratio (SNR), is a special case of SNRseg, when M=1 and one segment encompasses the whole record [18]. The SNR is very sensitive to the time alignment of the original and distorted audio signal. The SNR is measured as N SNR  10 log 10 x 2 (i ) i 1 N  x(i)  y(i)  2 i 1 This measure has been criticized for being a poor estimator of subjective audio quality [16]. Czenakowski Distance (CZD): This is a correlation-based metric [2], which compares directly the time domain sample vectors C 1 N N  i   1  2 * min( x (i ), y (i ))   x (i )  y (i )  Frequency-Domain Measures: These measures (LLR, LAR, IS, COSH, CDM, WSSD, SPD, SPMD, STFRT) compare the two signals on the basis of their spectra or in terms of a linear model based on second order statistics. Log-Likelihood Ratio (LLR): The LLR, also called as Itakura distance [10,12], considers an all-pole linear predictive p coding (LPC) model of speech segment xn   a(m) xn  m  Gx un , where {a(m), m=1,..p} are the m 1 prediction coefficients and u[n] is an appropriate excitation source. The LLR measure then is defined as  aT R a LLR  log  Tx x x a R a  y y y where     a x is the LPC coefficient vector for the original signal x[n], a y is the corresponding vector for the distorted R x and R y . signal y[n], with respective covariance matrices, Log Area Ratio (LAR): The log-area ratio measure is another LPC-based technique, which uses PARCOR (partial correlation) coefficients [18]. The PARCOR coefficients form a parameter set derived from the short-time LPC representation of the speech signal under test. The area ratio functions of these coefficients give the LAR. Itakura-Saito Distance Measure (ISD): This is the discrepancy between the power spectrum of the distorted signal Y(w) and that of the original audio signal, X(w):   Y ( w) X ( w)  dw IS    log   1 X ( w ) Y ( w )  2   COSH Distance Measure: COSH distance is the symmetric version of the Itakura-Saito distance. Here the overall measure is calculated by averaging the COSH values over the segments also.  COSH   1  Y ( w) X ( w)   dw   2  X ( w)  Y ( w)   1 2  Cepstral Distance Measure (CDM): The cepstral distance measure is a distance, defined between the cepstral coefficients of the original and distorted signals. The cepstral coefficients can also be computed by using LPC parameters [13]. An audio quality measure, based on the L cepstral coefficients cx(k) and cy(k), of the original and d (c x , c y , m)  distorted signals respectively, can be computed as     L  2 2 c ( 0 )  c ( 0 )  2 c x (k )  c y (k )   y  x k 1   1 2 for the m’th frame. The distortion is calculated over all frames using M CD   w(m)d (c m 1 x , c y , m) M  w(m) m 1 where M is the total number of frames, and w(m) is a weight associated with the m-th frame. The weighting could, for example, be the energy in the reference frame. In this study we use a 20 ms frame length and use the energy of the frame as weights. Spectral Phase and Spectral Phase-Magnitude Distortions: The phase and/or magnitude spectrum differences [2] have been observed to be sensitive to image and data hiding artifacts. They are defined as SP  SPM  1 N N  w 1 x ( w)   y ( w) 2 N N 2 2 1   *   x ( w)   y ( w)  (1   ) *  X ( w)  Y ( w)  N w1 w1  where SP is the spectral phase distortion and SPM is the spectral phase-magnitude distortion, spectrum of the original signal,  y (w)  x (w) is the phase is the phase of the distorted signal, X(w) is the magnitude spectrum of the original signal and Y(w) is magnitude spectrum of the distorted signal, and weights to the phase and magnitude terms.  is the is chosen to attach commensurate Short-Time Fourier-Radon Transform Measure (STFRT): Given a short time Fourier transform (STFT) of a signal, its time projection gives us the magnitude spectrum while its frequency projection yields the magnitude of the signal itself. More generally, rather than taking only the vertical and horizontal projections, if we consider all the other angles, we obtain the Radon transform of the STFT mass. We define the mean-square distance of Radon transforms of the STFT of two signals as a new objective audio quality measure. Perceptual measures: These measures (WSSD, BSD, MBSD, EMBSD, PAQM, PSQM, MNB) take explicitly into account the properties of the human auditory system. Bark Spectral Distortion (BSD): The BSD measure is based on the assumption that speech quality is directly related to speech loudness [26]. The signals are subjected to critical band analysis, equal-loudness pre-emphasis, and intensityloudness power law. The BSD estimates the overall distortion by using the average Euclidian distance between loudness vectors that of the reference and of the distorted audio. The Bark spectral distortion is calculated as K  BSD   S x (i )  S y (i )  2 i 1 where K is the number of critical bands, and S x (i ) and S y (i) are the Bark spectra in the i’th critical band corresponding to the original and the distorted speech, respectively. In this study the BSD is extended until audio bands. For speech the 18 critical bands (which is up to 3.7 kHz) are used, in our stuty in order to measure the distortions on audio bands we have calculated and used the 25 critical bands (which is up to 15.5kHz). The overall distortion is calculated by averaging the BSD values of the speech segments. Modified Bark Spectral Distortion (MBSD): The MBSD is a modification of the BSD, which incorporates noisemasking threshold to differentiate between audible and inaudible distortions [28]. Any inaudible loudness difference, which is proportional to S x (i)  S y (i) , below the noise-masking threshold is excluded in the calculation of the perceptual distortion. The perceptual distortion of the n-th frame is defined as the sum of the loudness difference which is greater than the noise masking threshold and is formulated as: K MBSD   M (i ) D xy (i ) i 1 where M(i) and D xy (i ) denote the indicator of perceptible distortion and the loudness difference in the i-th critical band, respectively, K is the number of critical bands. The global MBSD value is calculated by averaging the MBSD scores over non-silence frames. Enhanced Modified Bark Spectral Distortion (EMBSD): EMBSD is a variation of MBSD in that only the first 15 loudness components (instead of the 24-Bark bands) are used to calculate loudness differences, loudness vectors are normalized, and a new cognition model is assumed based on post masking effects as well as temporal masking as in [29]. Perceptual Audio Quality Measure (PAQM): In PAQM, a model of the human auditory system is emulated [3]. The transformation from the physical domain to the psychophysical (internal) domain is performed first by time-frequency spreading and level compression, such that masking behavior of the human auditory system is taken into account. Here the signal is first transformed into short-time Fourier domain, then the frequency scale is converted into pitch scale z (in bark) and the signal is filtered to transfer from outer ear to inner ear. This results in the power-time-pitch representation. Subsequently the resulting signal is smeared and convolved with the frequency-spreading function, which is finally transformed to compressed loudness-time-pitch representation. The quality of an audio system can now be measured using this compressed loudness-time-pitch representation. Perceptual Speech Quality Measure (PSQM): PSQM is as a modified version of the PAQM [4], in fact the optimized version for speech. For example, for loudness computation, PSQM does not include temporal or spectral masking and it applies a nonlinear scaling factor to the loudness vector of distorted speech. PSQM has been adopted as ITU-T Recommendation P.861. Weighted Slope Spectral Distance Measure (WSSD): A smooth short-time audio spectrum can be obtained using a filter bank, consisting of thirty-six overlapping filters of progressively larger bandwidth [14]. The filter bandwidths approximate critical bands in order to give equal perceptual weight to each band. Klatt uses weighted differences between the spectral slopes in each band [15] since the spectral variation plays an important role in human perception of audio quality. The spectral slope is computed in each critical band as, Vx (k )  X (k  1)  X (k ) and Vy (k )  Y (k  1)  Y (k ) , where {X(k), Y(k)} are the spectra in decibels, {Vx (k ),Vy (k )} are the first order slopes of these spectra, and k is the critical band index. Next, a weight for each band is calculated based on the magnitude of the spectrum in that band: 36  WSSD   w(k ) V x (k )  V y (k )  2 k 1 where the weight w(m) is chosen according to a spectral maximum. The WSSD is computed separately for each 12 ms audio segment and then by averaging the overall distance. Measuring Normalizing Blocks (MNB): The MNB emphasizes the important role of the cognition module for estimating speech quality [25]. The technique is based on a transformation of speech signals into an approximate loudness domain through frequency warping and logarithmic scaling, which are the two important factors in the human auditory response. MNB considers human listener’s sensitivity to the distribution of distortion, so it uses hierarchical structures that work from larger time and frequency scales to smaller time and frequency scales. MNB integrates over frequency scales and measures differences over time intervals as well it integrates over time intervals and measures differences over frequency scales. These MNBs are linearly combined to estimate overall speech distortion. REFERENCES [1] Avcıbaş, İ., Memon N., Sankur B., “Steganalysis using image quality metrics”, IEEE Trans. on Image Process., January 2003. [2] Avcıbaş, İ., Sankur B., and Sayood K., ”Statistical evaluation of image quality metrics”, Journal of Electronic Imaging 11(2), 206– 223 (April 2002). [3] Beerends, J. G. and J. A. Stemerdink, “A perceptual audio quality measure based on a psychoacoustics sound representation,” J. Audio Eng. Soc., vol. 40, pp. 963-978, Dec. 1992. [4] Beerends, J. G. and J. A. Stemerdink, “A perceptual speech quality measure based on a psychoacoustic sound representation,” J. Audio Eng. Soc., vol. 42, pp. 115-123, Mar. 1994. [5] Bender, W., D. Gruhl, N. Morimoto, and A. Lu, “Techniques for data hiding”, IBM Systems Journal, vol. 35, no: 3&4, pp. 313-336, 1996. [6] Coifman, R. R., and D. L. Donoho, “Translation-invariant denoising,” in Wavelets and Statistics A. Antoniadis and G. Oppenheim, Eds, Springer-Verlag lecture notes, San Diego, 1995. [7] Cox, I., J. Kilian, F. T. Leighton, and T. Shamoon, “Secure spread spectrum watermarking for multimedia”, IEEE Trans. on Image Process., vol. 6, no: 12, pp. 1673-1687, December 1997. [8] Gojan, J., M. Goljan and R. Du, “Reliable detection of LSB steganography in color and grayscale images”, Proc., of the ACM Workshop on Mult. And Secur., Ottawa, CA, pp. 27-30, October 5, 2001. [9] Hyvarinen A., P. Hoyar, and E. Oja, “Sparse code shrinkage for image denosing”, In Proc. of IEEE Int. Joing. Conf. of Neural Networks, pp. 859-864, Anchorage, Alaska. [10] Itakura, F., “Minimum prediction residual principle applied to speech recognition”, IEEE Trans. Acoust., Speech and Signal Processing, vol. ASSP-23, no. 1, pp. 67-72, Feb. 1975. [11] Johnson, N.F., S. Jajodia, “Steganalysis of images created using current steganography software”, in David Aucsmith (Ed.): Information Hiding, LNCS 1525, pp. 32-47. Springer-Verlag Berlin Heidelberg, 1998. [12] Juang , B. H., “On using the Itakura-Saito measure for speech coder performance evaluation,” AT&T Bell Laboratories Tech. Jour., vol. 63, no. 8, pp. 1477-1498, Oct. 1984. [13] Kitawaki, N., H. Nagabuchi, and K. Itoh, “Objective quality evaluation for low-bit-rate speech coding systems,” IEEE J. Select. Areas Commun., vol. 6, pp. 242-248, Feb. 1988. [14] Klatt, D. H., “A digital filter bank for spectral matching,” Proc. 1976 IEEE ICASSP, pp. 573-576, Apr. 1976. [15] Klatt, D. H., “Prediction of perceived phonetic distance from critical-band spectra: a first step,” Proc. 1982 IEEE ICASSP, Paris, pp. 1278-1281, May 1982. [16] McDermott, B. J., C. Scaglia, and D. J. Goodman, “Perceptual and objective evaluation of speech processed by adaptive differential PCM,” IEEE ICASSP, Tulsa, pp. 581-585, Apr. 1978. [17] Pudil, P., J. Novovicova, and J. Kittler, “Floating search methods in feature selection”, Pattern Recognition Letters, 15, pp. 1119-1125, 1994. [18] Quackenbush, S. R., T. P. Barnwell III, and M. A. Clements, Objective Measures of Speech Quality, Prentice Hall, Englewood Cliffs, 1988. [19] Rencher, Methods of multivariate data analysis, New York, John Wiley, 1995. [20] Simmons, G. J., Prisoner’s problem and the subliminal channel, CRYPTO83-Advances in Cryptology, pp. 51-67, August, 22-24, 1984. [21] Steganos, www.steganos.com. [22] Stools, A. Brown, S-Tools version 4.0, Copyright C. 1996, http://members.tripod.com/steganography/stego/stools4.html. [23] Vapnik, V., The Nature of Statistical Learning Theory. Springer, New York, 1995. [24] Voloshynovsky, S., S. Pereira, V. Iquise, and T. Pun, “Attack modeling: towards a second generation watermarking benchmark”, Signal Processing, vol. 81, pp. 1177-1214, 2001. [25] Voran, S., “Objective estimation of perceived speech quality, part I: development of the measuring normalizing block technique”, IEEE Transactions on Speech and Audio Processing, in Press, 1999. [26] Wang, S., A. Sekey, and A. Gersho, “An objective measure for predicting subjective quality of speech coders”, IEEE J. Select. Areas Commun., vol. 10, pp. 819-829, June 1992. [27] Westfeld, A. Pfitzmann, “Attacks on steganographic systems”, in Information Hiding, LNCS 1768, pp. 61-66, Springer-Verlag Heidelberg, 1999. [28] Yang, W, M. Dixon, and R. Yantorno, “A modified bark spectral distortion measure which uses noise masking threshold,” IEEE Speech Coding Workshop, pp. 55-56, Pocono Manor, 1997. [29] Zwicker , E. and H. Fastl, Psychoacoustics Facts and Models, Springer-Verlag, 1990.

Steganalysis of Audio Based on Audio Quality Metrics

Related documents

Products

Support

Steganalysis of Audio Based on Audio Quality Metrics

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib