File Formats, the Internet, and More Introduction Ultimately, once a sound has been digitized and (optionally) processed, it will need to be either stored or transmitted elsewhere. Just as there are tradeoffs between sound quality and the specifics of quantization, there are tradeoffs between sonic quality and storage size or transmission bandwidth. File Formats There are many file formats available for audio files including WAVE (WAV), AIFF, AU, SND, VOC, MP3, and others. Perhaps the most widely used file formats (in terms of application support anyway) are the venerable WAV and AIFF formats found on Windows and Macs, respectively. These cover numerous variants, although the most widely used are mono or stereo linear PCM at 8 or 16 bit resolution recorded at either 22.05 kHz or 44.1 kHz. There are many other possibilities including floating point data, multiple channels beyond stereo, a variety of sample rates, and possible data compression schemes, although these tend not to have nearly as wide support. The file formats generally consist of a short header which includes important items such as the sample rate, number of channels, and so on, followed by the data samples themselves (interleaved if multiple channels1). Consequently, the size of the file in bytes is only modestly larger than length_in_seconds * sample_rate * bytes_per_sample * number_of_channels, perhaps by a hundred or so bytes. Recalling that CD quality audio requires 16 bit resolution at 44.1 kHz sampling, each mono minute requires over 5 Megabytes of storage along with a streaming bandwidth of over 700 k bits per second per channel. It is clear that large audio libraries take up vast space in this form and require T-1 capacity for real time streaming of stereo material. In order to reduce disk usage and lower bandwidth requirements, some users opt to lower sound quality by reducing the sampling rate or bit resolution. This is acceptable for some applications, but few people would give up CD quality audio in favor of an 8 bit 22.05 kHz variant. One possibility is to use a form of compression on the audio files, such as that used for “zipping” email attachments. This form of compression is known as lossless because the compression/decompression processes are symmetrical: You get exactly what you started with. They work by finding mathematical redundancies in the file and replacing them with something smaller. A good example can be made using a text file. Suppose that you compress a letter to a friend. In this letter the character combination “ the ” occurs numerous times. That is a five byte sequence. Suppose you replace every occurrence with a special number that lies outside the normal range of ASCII character codes. This one byte takes the place of five, for a four byte savings per occurrence. On decompression, the special character is expanded back into “ the ”. There will, of course, need to be some sort of table to indicate what the replacement is, but for a large file this overhead can be ignored. This technique works quite well with certain types of data. Text files can usually be compressed to less than half of their normal size, and in some cases considerably less than that. Unfortunately, the more variation there is in the data, the less effective the compression will be. Musical waveforms generally do not contain much redundancy of the sort exhibited by text files. 1 Interleaving makes for more efficient access from a hard drive if the sound is to be sent in real time to a playback or transmission device. ET163 Audio Technology Lecture Notes: File Formats, the Internet, and More 1 To illustrate this, the popular WinZip program was applied to a one second long CD quality generated sine wave file. The original file was approximately 88 k bytes. The zipped file was merely 85 k bytes, for a savings of just over 3%. In contrast, a similar length pulse (not bandwidth limited) shrunk from 88 k to less than 1 k byte. Unfortunately, real world music signals are much closer to the former case than the later. This form of compression is generally of minimal use for audio. Lossy Compression, Perceptual Coding, and MP3 In contrast to lossless or perfectly reversible compression is something called lossy compression. With lossy compression, the decompressed version is not identical to the original. The idea is to first analyze the file and determine which parts of it are significant. The not-so-significant parts are thrown away and therefore, there is less data to compress which will lead to better compression ratios. The trick here is to determine what is and what is not significant. This is tied directly to the information in the file. That is, the algorithm that determines significance will be entirely different between audio files and picture files for example. The algorithm attempts to model the human perception system in use. In the case of audio files, that means a psychoacoustic model of human hearing is needed. Simply put, the perceptual coder tosses away anything that you can’t hear. This includes frequency components below the absolute hearing threshold and components that are masked by louder, nearby frequency content. This simplified data is then used for the new file. Because items have been thrown away, the decompressed version is not identical to the original, although if done properly it should sound no different. If the perceptual coder is adjusted so that it throws away “less important” content instead of inaudible content, even greater compression can be achieved, although with a reduction in quality. Perceptual coders are what make the high compression rates of JPG (JPEG) and MP3 (MPEG layer 3) files possible. In the case of JPG picture files, high compression may remove the subtle variations of skin tone or sky color, creating a somewhat blocky or “pixelated” result. High compression ratios for MP3’s usually result in a loss of high frequency content and dynamics. As the compression ratio is under user control, the user can choose the compression best for the job and reap maximum benefits. It must be remembered though, that once a perceptual coder has done its work, there’s no going back from the resulting file. For this reason, file libraries often leave data in an uncompressed or modestly compressed form, using higher compression only as needed for specific applications. For example, a band might record a song using CD quality sampling for processing and production. The final work will be archived this way. The song will then be compressed, perhaps at various compression levels. The song can then be distributed in a more space efficient manner, yet the original remains intact. The Internet Many people see the Internet as a sort of vast interconnected library, complete with a messaging system. This offers up many possibilities. For example, once a band has recorded a song and then compressed it into MP3 format, it can be placed on a web server. People from all over the globe can down-load the song and play it using either their computer’s MP3 capabilities or those of a separate MP3 player. For users with a limited bandwidth Internet connection such as a modem dial-up, a lower fidelity version can be downloaded in similar times due to the higher compression ratio. Of course, the user may prefer the lower fidelity simply to get a larger number of songs onto their MP3 player. This technique also allows for live streaming audio to different users with differing bandwidths. For example, users with cable modem or DSL may get near CD ET163 Audio Technology Lecture Notes: File Formats, the Internet, and More 2 quality (modest compression) while the dial-up user gets the high compression lower fidelity version (lower yes, but at least they get something). Of course, there is also the possibility of users compressing copyrighted material and sharing it without consent of the owners of the material (normally the musicians). There has been considerable coverage of this topic in the media, mostly notably the Napster/RIAA fracas. This is not limited to the music industry, although this is the one area that does have the spotlight on it. It is worthwhile to remember that technology only dictates what is technically possible, not what is ethical. It does not follow that what can be done, should be done2. Technology tends to be ethically inert. Ethics is a descriptor of human behavior. If something is taken from some party without that party’s consent (direct or implied), that legally constitutes theft. How easy it is to do and the narrow likelihood of being caught, do not alter this fact. 2 For example, it is technically possible to create and install a small subcutaneous implant for each citizen with characteristics that will uniquely identify individuals at various points in a city, but do we want to do this in a free and open society? ET163 Audio Technology Lecture Notes: File Formats, the Internet, and More 3