AUDIO STEGANOGRAPHY FOR COVERT DATA TRANSMISSION BY IMPERCEPTIBLE TONE INSERTION Kaliappan Gopalan1 and Stanley Wenndt2 1 Department of Engineering, Purdue University Calumet, Hammond, IN 46323 gopalan@calumet.purdue.edu 2 Multi-Sensor Exploitation Branch, AFRL/IFEC, Rome, NY 13441 Stanley.Wenndt @rl.af.mil ABSTRACT This paper presents the technique of embedding data in an audio signal by inserting low power tones and its robustness to noise and cropping of embedded speech samples. Experiments on the embedding procedure applied to cover audio utterances from noise-free TIMIT database and a noisy database demonstrate the feasibility of the technique in terms of imperceptible embedding, high data rate and accurate data recovery. The low power levels ensure that the tones are inaudible in the message-embedded stego signal. Besides imperceptibility in hearing, the spectrogram of the stego signal also conceals the existence of embedded information. Both of these features render the detection of embedding in the stego signal difficult to accomplish. Oblivious detection of the stego signal, instead of escrow detection, yields the embedded information accurately. In addition, results of two cases of attacks on the data-embedded stego audio, namely, additive noise and random cropping, show the technique is robust for covert communication and steganography. Keywords: Audio Steganography, Imperceptible tone insertion 1. INTRODUCTION Covert communication by embedding a message or data file in a cover medium has been increasingly gaining importance in the all-encompassing field of information technology. Audio steganography is concerned with embedding information in an innocuous cover speech in a secure and robust manner. Communication and transmission security and robustness are essential for transmitting vital information to intended sources while denying access to unauthorized persons. By hiding the information using a cover or host audio as a wrapper, the existence of the information is concealed during transmission. This is critical in applications such as battlefield communications and bank transactions, for example. Steganography, in general, relies on the imperfection of the human auditory and visual systems. Audio steganography takes advantage of the psychoacoustical masking phenomenon of the human auditory system [HAS]. Psychoacoustical, or auditory masking property renders a weak tone imperceptible in the presence of a strong tone in its temporal or spectral neighborhood. This property arises because of the low differential range of the HAS even though the dynamic range covers 80 dB below ambient level [1, 2]. Frequency masking occurs when human ear cannot perceive frequencies at lower power level if these frequencies are present in the vicinity of tone- or noise-like frequencies at higher level. Additionally, a weak pure tone is masked by wide-band noise if the tone occurs within a critical band. This property of inaudibility of weaker sounds is used in different ways for embedding information. Embedding of data by inserting inaudible tones in cover audio signal has been presented recently [3,4]. The following sections describe the tone insertion technique and its robustness in retrieving the embedded information in the presence of noise and in cropped frames of received speech. 2. EMBEDDING BY TONE INSERTION The tone insertion method relies on the inaudibility of low power tones in the presence of significantly higher spectral components – an indirect exploitation of the psychoacoustic masking phenomenon in the spectral domain. Experiments were conducted using utterances from (a) the TIMIT database, and (b) the Greenflag database consisting of noisy recordings of air traffic controllers, as host or cover audio samples. In the first experiment, two tones at frequencies f0 and f1 are generated for embedding bit 0 and bit 1 respectively. As seen in Fig. 1, the host audio is divided into nonoverlapping segments of 16 ms in duration. For the host utterances used, f0 is set at 1875 Hz and f1 at 2625 Hz arbitrarily. For every frame of host audio, the frame power fe, is computed and only one bit of data is embedded into the host audio frame. If the bit to be embedded is a 0, then the power of f0 is set at 0.25 percent of fe and the power of f1 is a bit of 1, the power of f1 is set at 0.25 percent of fe and the power of f0 is set at 0.001 of the power of f1. The simultaneous setting of significant and extremely low powers to the tones facilitates concealed embedding and Host Audio Segment to 16 ms frames Compute frame power, fe Information to embed Embed 0 N o Power of f1 → 0.25% of fe Power of f0 → 0.001 of f1 set at 0.001 of that of f0. To embed Embedding Two Bits Per Audio Frame The second experiment extended the technique to double the payload capacity by using four tones. For this experiment, one tone out of a selection of four tones, namely, 750 Hz, 1250 Hz, 1875 Hz, and 2625 Hz, was set to 0.25 percent of the average power of each frame while the other tones were set to negligible values. For detection, the ratio of frame power to power at each tone was used. This ratio, clearly, is a minimum for the tone that was set at 0.25 percent of the frame power. To add further security in transmission, a 4-bit key for each frame was used at the transmitter to determine which one of the four tones would be set at high power relative to the other three tones. Embedded two-bit combination from each frame was recovered by using the same key at the receiver and detecting the relative power ratio of each tone and the frame. The results of these experiments with noise and cropping are presented in the next section. 3. EXPERIMENTAL RESULTS Y Power of f0 → 0.25% of fe Power of f1 → 0.001 of f0 Quantize to 16 bits Transmit Frame correct detection of data. The low and relatively high power Fig. 1 Tone insertion algorithm ratios avoid one or both of the tones being detected in hearing or spectrogram – if only one of the tones is set to a fixed power ratio relative to the frame power, the other tone may be noticeable in cases where the host frame inherently has a substantial component at the tone frequency. The second advantage is that a known high/low ratio of power between the tones facilitates the detection of the embedded bit even when the embedded amplitudes are scaled or quantized. The frames with their spectral components at the tone frequencies set in accordance with the data bits constituted the stego signal. For transmission, the embedded-frame is quantized to 16 bits, which is the same as the original host audio sample size. For recovering the covert information from every received frame of audio, the frame power fe is computed along with the power p0 and p1 at f0 and f1. If the ratio, (fe/ p0) > (fe/ p1), then the covert bit is declared a 0. Otherwise, the covert bit embedded in the frame is considered a 1. In the first experiment using the TIMIT utterance, “She had your dark suit in greasy wash water all year,” which is available as 16 bit samples at the rate of 16,000 per second, nonoverlapped frames of 256 samples were used to embed one bit in each. With 208 frames, a random data of 208 bits were embedded by inserting tones at 1875 Hz and 2625 Hz with appropriate powers, and the tone-inserted stego signal was quantized to 16 bits for transmission. From the stego, all 208 bits were successfully recovered from the ratios of frame power to power at the tone frequencies. From informal listening tests and from the spectrograms, the stego signal was found indistinguishable from the unembedded host audio signal. In the second experiment, four tones were used to embed two bits in each frame. In addition, successive frames for embedding were overlapped with 50 percent to further increase the payload capacity. After verifying the imperceptibility of and the data recovery from the stego signal, the technique was extended for use in covert battlefield communication in which the hidden information can be another utterance. For initial studies, the utterance, “seven one” spoken by a male speaker, was used as the covert message. This utterance was represented in the Global System for Mobile communication half-rate (GSM 06.20) coding scheme resulting in a compact form of 2800 bits. Two TIMIT utterances (each with 16 bit samples and 16,000 samples/s) were concatenated to accommodate the large covert information size. With two bits inserted in each host frame of 256 samples, only the first 1400 overlapped frames out of a total of 1542 were used for embedding all Figs. 2 and 3 show the host and the stego signals and their spectrograms using the frequency-hopped four-tone insertion for embedding the covert message. No perceptual or otherwise detectible difference was noticed between the host and the stego signals and all the embedded data were correctly recovered from the stego signal. 4 4 Host - TIMIT x 10 2 0 -2 0 0.2 0.4 0.6 10 0.8 1 1.2 1.4 1.6 1.8 2 5 4 x 10 Stego - 2 bits/frame x 10 5 0 -5 0 0.2 0.4 0.6 0.8 1 1.2 Sample index 1.4 1.6 1.8 2 errors results in sufficient quality to convey the message albeit with noise. Hence, noise power levels as high as 5 percent of frame power, which resulted in only 70 to 80 bits in error out of a total of 2800 bits, can be used to transmit covert messages in encoded form with the tone embedding technique. Host - TIMIT Frequency 8000 6000 4000 2000 0 0 2 4 6 8 10 12 Time Stego - 2 bits/frame 8000 Frequency the covert message bits. This gives an embedding capacity of 2800 bits in 11.208 s , or 249.82 bits/s. Tones for insertion were selected at frequencies of 687.5 Hz, 1187.5 Hz, 1812.5 Hz, and 2562.5 Hz. These frequencies were either absent or weak in the host frames. With four tones, however, an additional step was necessitated to prevent the detection of embedding. Presence of a continuous stream of 0’s or 1’s in the covert data, for instance, results in the same tone being set at 0.25 percent of the corresponding frame power. Although a listener may not be able to perceive the tone because of its low power, the spectrogram is likely to show continuous spectral nulls or ‘holes’ at the remaining three tone frequencies. To a malicious attacker, these artifacts are indicative of host manipulation even without the knowledge of host spectrogram. Use of a 4-bit key that sets the order of the tones for each frame by frequency hopping avoids such an obvious detection of embedding [3]. 6000 4000 2000 0 0 2 4 6 8 10 Time 12 Fig. 3 Spectrograms of host (top) and stego with 2800 bits embedded (bottom) A more serious attack on the stego during transmission than additive noise is the random deletion of a few samples. Since removal of up to one in 50 samples has been shown to cause no perceptible difference [5], we studied its effect on data recovery. With the removal of five samples from each stego frame of 256 samples and replacing them with zeros, bit errors of 4 to 10 out of 2800 were observed. When 10 samples in each stego frame were replaced zeros, the bit error increased to 52 to 62. When the removed samples were replaced by their neighbors, the error increased to about 30 for five sample replacement and to about 80 for 10 sample replacement. Still, as noted previously, a malicious attack on the stego may render the host message noisy while still carrying the coded covert audio message in a perceivable form. 5 x 10 Fig. 2 Host (top) and stego with 2800 bits embedded (bottom) To study the effect of noise on the recovery of embedded information, Gaussian noise with zero mean and average power proportional to the frame power of embedded stego was added to each frame. Random noise at low power is unlikely to increase the power of any of the three tones to a level that exceeds the power level of the significant tone. Hence, the pair of the embedded bits from each of the noiseadded stego frames was successfully recovered up to noise power set to 25 percent of frame power. Bit errors started showing up at higher noise power levels. However, speech decoded from GSM-coded speech with up to 10 percent bit In addition to the clean host from the TIMIT database, the frequency-hopped tone insertion was also used in an experiment with a noisy host from the Greenflag database. Obtained as 16-bit PCM data at a rate of 8000 samples per second, the Greenflag database consists of utterances from the cockpit of fighter aircraft. Because of the high level of noise inherent in the host, externally introduced tone or noise arising from embedding is generally not noticeable. Figs. 4 and 5 show the result of embedding the GSM-coded covert speech, ‘seven one’ on a host consisting of two utterances from the Greenflag database. Using 128 samples per frame the host of 80128 samples has 1251 frames which can embed only 2502 bits out of the 2800 bits of the coded covert speech. This gives an embedding capacity of 249.8 bits/s. Because of the high level of intrinsic noise in the host, the dominant tone power was raised to more than 10 percent. Although the stego signal did not show any perceptual difference from the host, the higher tone power started showing up in the spectrogram. To mask the dominant tone in the spectrogram, the tones were set to frequencies in the range where the host has significant energy. In the 400 Hz to 1000 Hz range, for example, the host has relatively high spectral energy over almost the entire duration. Hence, inserting tones at frequencies of 625 Hz, 750 Hz, 875 Hz and 1062.5 Hz, with as much as 25 percent of frame tones, randomized because of the hopping key, were clearly masked by the already significant spectral components in the host. Fig. 5 shows the spectrograms of the host and stego for comparison. With additive noise raised up to each stego frame power, all 2502 bits were correctly recovered. At higher noise levels of up to 2.5 times the frame power, the bit error was below 80. Cropping by zeroing or replacing from 3 to 50 samples in each stego frame caused no bit error due to relatively higher power of the inserted tones. At higher number of destroyed samples, stego became highly noisy; still, the bit error was negligible. 4. DISCUSSION AND CONCLUSION 4 4 Host - GF x 10 2 0 -2 0 1 2 3 10 4 5 6 7 8 9 4 4 x 10 Stego - 2 bits/frame x 10 5 0 -5 0 1 2 3 4 5 Sample index 6 7 8 9 4 x 10 Fig. 4 Greenflag utterance host (top) and GSM-coded bits embedded stego 2000 Malicious attacks involving replacement of embedded samples with zeros or neighboring values appear to cause less loss of data at smaller number of samples. Cropping of a large number of samples destroys the cover audio, however. 1000 REFERENCES Host - GF Frequency 4000 3000 0 0 1 2 3 4 5 6 7 8 9 Time Stego - 2 bits/frame 4000 Frequency The results of the present experiments clearly demonstrate the feasibility of the proposed technique for audio steganography with imperceptibility, payload and data recovery. At the embedding rate of approximately 250 bits/s, the technique has a high payload capacity. Any attempt to increase the capacity further must use more than four tones. However, use of eight tones for embedding 3 bits/frame, for example, may lead to audible and/or visible artifacts unless the selected tones are absent in the host audio. Also, noise – intentional or unintentional – may cause high bit errors at high capacity. Noisy host signals, on the other hand, can have larger payload and use higher power without significant loss of data. In general, tones selected from high energy regions of the host can be masked in hearing and spectrogram by their low power levels. 3000 2000 1000 0 0 1 2 3 4 5 6 7 8 9 Time Fig. 5 Spectrograms of Greenflag host (top) and stego with 2502 bits (bottom) power in the dominant tone did not result in any noticeable difference in speech quality or spectrogram; the inserted [1] W. Bender, D. Gruhl, N. Morimoto and A.Lu, “Techniques for data hiding,” IBM Systems Journal, Vol. 35, Nos. 3 & 4, pp. 313-336, 1996. [2] M.D. Swanson, M. Kobayashi, and A.H. Tewfik, “Multimedia data-embedding and watermarking technologies,” Proc. IEEE, Vol. 86, pp. 1064-1087, June 1998. [3] K. Gopalan, S. Wenndt, A. Noga, D. Haddad, and S. Adams, “Covert Speech Communication Via Cover Speech By Tone Insertion,” Proc. of the 2003 IEEE Aerospace Conference, Big Sky, MT, Mar. 2003 (on CD). [4] K. Gopalan, et al, “Covert Speech Communication Via Cover Speech By Tone Insertion,” U.S. Patent applied for, Oct. 2003. [5] R.J. Anderson and F.A.P. Petitcolas, “On the limits of steganography,” IEEE J. Selected Areas in Communications, Vol. 16, No. 4, pp.474-481, May 1998.