File Formats, the Internet, and More

advertisement
File Formats, the Internet, and More
Introduction
Ultimately, once a sound has been digitized and (optionally) processed, it will need to be either
stored or transmitted elsewhere. Just as there are tradeoffs between sound quality and the
specifics of quantization, there are tradeoffs between sonic quality and storage size or
transmission bandwidth.
File Formats
There are many file formats available for audio files including WAVE (WAV), AIFF, AU, SND,
VOC, MP3, and others. Perhaps the most widely used file formats (in terms of application
support anyway) are the venerable WAV and AIFF formats found on Windows and Macs,
respectively. These cover numerous variants, although the most widely used are mono or stereo
linear PCM at 8 or 16 bit resolution recorded at either 22.05 kHz or 44.1 kHz. There are many
other possibilities including floating point data, multiple channels beyond stereo, a variety of
sample rates, and possible data compression schemes, although these tend not to have nearly as
wide support. The file formats generally consist of a short header which includes important items
such as the sample rate, number of channels, and so on, followed by the data samples themselves
(interleaved if multiple channels1). Consequently, the size of the file in bytes is only modestly
larger than length_in_seconds * sample_rate * bytes_per_sample * number_of_channels, perhaps
by a hundred or so bytes. Recalling that CD quality audio requires 16 bit resolution at 44.1 kHz
sampling, each mono minute requires over 5 Megabytes of storage along with a streaming
bandwidth of over 700 k bits per second per channel. It is clear that large audio libraries take up
vast space in this form and require T-1 capacity for real time streaming of stereo material.
In order to reduce disk usage and lower bandwidth requirements, some users opt to lower sound
quality by reducing the sampling rate or bit resolution. This is acceptable for some applications,
but few people would give up CD quality audio in favor of an 8 bit 22.05 kHz variant. One
possibility is to use a form of compression on the audio files, such as that used for “zipping” email attachments. This form of compression is known as lossless because the
compression/decompression processes are symmetrical: You get exactly what you started with.
They work by finding mathematical redundancies in the file and replacing them with something
smaller. A good example can be made using a text file. Suppose that you compress a letter to a
friend. In this letter the character combination “ the ” occurs numerous times. That is a five byte
sequence. Suppose you replace every occurrence with a special number that lies outside the
normal range of ASCII character codes. This one byte takes the place of five, for a four byte
savings per occurrence. On decompression, the special character is expanded back into “ the ”.
There will, of course, need to be some sort of table to indicate what the replacement is, but for a
large file this overhead can be ignored.
This technique works quite well with certain types of data. Text files can usually be compressed
to less than half of their normal size, and in some cases considerably less than that.
Unfortunately, the more variation there is in the data, the less effective the compression will be.
Musical waveforms generally do not contain much redundancy of the sort exhibited by text files.
1
Interleaving makes for more efficient access from a hard drive if the sound is to be sent in real time to a
playback or transmission device.
ET163 Audio Technology Lecture Notes: File Formats, the Internet, and More
1
To illustrate this, the popular WinZip program was applied to a one second long CD quality
generated sine wave file. The original file was approximately 88 k bytes. The zipped file was
merely 85 k bytes, for a savings of just over 3%. In contrast, a similar length pulse (not
bandwidth limited) shrunk from 88 k to less than 1 k byte. Unfortunately, real world music
signals are much closer to the former case than the later. This form of compression is generally
of minimal use for audio.
Lossy Compression, Perceptual Coding, and MP3
In contrast to lossless or perfectly reversible compression is something called lossy compression.
With lossy compression, the decompressed version is not identical to the original. The idea is to
first analyze the file and determine which parts of it are significant. The not-so-significant parts
are thrown away and therefore, there is less data to compress which will lead to better
compression ratios. The trick here is to determine what is and what is not significant. This is tied
directly to the information in the file. That is, the algorithm that determines significance will be
entirely different between audio files and picture files for example. The algorithm attempts to
model the human perception system in use. In the case of audio files, that means a psychoacoustic model of human hearing is needed. Simply put, the perceptual coder tosses away
anything that you can’t hear. This includes frequency components below the absolute hearing
threshold and components that are masked by louder, nearby frequency content. This simplified
data is then used for the new file. Because items have been thrown away, the decompressed
version is not identical to the original, although if done properly it should sound no different. If
the perceptual coder is adjusted so that it throws away “less important” content instead of
inaudible content, even greater compression can be achieved, although with a reduction in
quality. Perceptual coders are what make the high compression rates of JPG (JPEG) and MP3
(MPEG layer 3) files possible. In the case of JPG picture files, high compression may remove the
subtle variations of skin tone or sky color, creating a somewhat blocky or “pixelated” result.
High compression ratios for MP3’s usually result in a loss of high frequency content and
dynamics. As the compression ratio is under user control, the user can choose the compression
best for the job and reap maximum benefits. It must be remembered though, that once a
perceptual coder has done its work, there’s no going back from the resulting file. For this reason,
file libraries often leave data in an uncompressed or modestly compressed form, using higher
compression only as needed for specific applications. For example, a band might record a song
using CD quality sampling for processing and production. The final work will be archived this
way. The song will then be compressed, perhaps at various compression levels. The song can
then be distributed in a more space efficient manner, yet the original remains intact.
The Internet
Many people see the Internet as a sort of vast interconnected library, complete with a messaging
system. This offers up many possibilities. For example, once a band has recorded a song and then
compressed it into MP3 format, it can be placed on a web server. People from all over the globe
can down-load the song and play it using either their computer’s MP3 capabilities or those of a
separate MP3 player. For users with a limited bandwidth Internet connection such as a modem
dial-up, a lower fidelity version can be downloaded in similar times due to the higher
compression ratio. Of course, the user may prefer the lower fidelity simply to get a larger number
of songs onto their MP3 player. This technique also allows for live streaming audio to different
users with differing bandwidths. For example, users with cable modem or DSL may get near CD
ET163 Audio Technology Lecture Notes: File Formats, the Internet, and More
2
quality (modest compression) while the dial-up user gets the high compression lower fidelity
version (lower yes, but at least they get something).
Of course, there is also the possibility of users compressing copyrighted material and sharing it
without consent of the owners of the material (normally the musicians). There has been
considerable coverage of this topic in the media, mostly notably the Napster/RIAA fracas. This is
not limited to the music industry, although this is the one area that does have the spotlight on it.
It is worthwhile to remember that technology only dictates what is technically possible, not what
is ethical. It does not follow that what can be done, should be done2. Technology tends to be
ethically inert. Ethics is a descriptor of human behavior. If something is taken from some party
without that party’s consent (direct or implied), that legally constitutes theft. How easy it is to do
and the narrow likelihood of being caught, do not alter this fact.
2
For example, it is technically possible to create and install a small subcutaneous implant for each citizen
with characteristics that will uniquely identify individuals at various points in a city, but do we want to do
this in a free and open society?
ET163 Audio Technology Lecture Notes: File Formats, the Internet, and More
3
Download