Recording the Verdi Requiem in Surround Sound and High-Definition Video David Griesinger Harman Specialty Group – (till Nov. 21) www.david.griesinger.com The Task: • To make a first-rate recording of the Verdi Requiem from a performance in a 1200 seat hall that was packed with people. – The natural reverberation was not useful. – The setup time was very limited. – There were limits to the number of microphones that could be used, and to their placement. – The microphones had to be largely invisible to the video camera. • For all these reasons a “natural” microphone technique could not be used. – The purpose of this talk is in part to show that with a thoughtful capture of the direct sound, an entirely natural sound can be created. – even when this sound did not exist during the performance. Audio Goals • 1. Clear reproduction of direct sound (low muddiness) – The instruments and the chorus should not sound far-away – “Sonic Distance” should be relatively low • 2. Convincing sense of depth – The chorus should sound behind the orchestra, – The soloists should sound in the middle of the orchestra –where they appear in the video. – The orchestra should sound behind the loudspeakers – not close to the listener. • 3. High hall envelopment – The low frequencies especially should surround the listener – “Conductor’s perspective” draws the listener into the performance. – The hall sound should be LARGE – matching the scale of the piece. • 4. No “sweet-spot” – sound should be excellent and nearly the same throughout the room. – This requires use of the center channel and a LOW degree of correlation between the channels at all frequencies. – Leakage and panning always reduces the listening area. Envelopment • The goals of high envelopment and a large sweet spot have similar requirements: – The correlation between output channels needs to be low at all frequencies. • Thus we should avoid panning signals between channels. – Closely-spaced microphone arrays nearly always produce correlated signals – panning is inherent in the acoustic pickup. • Reverberation is correlated, particularly at low frequencies. – A large sweet spot demands high separation: • Sounds produced on the left should NOT be reproduced by speakers on the right! • This is a frequent problem with so-called “main microphone arrays” Example: Time delay panning outside the sweet spot. Record the orchestra with a “Decca Tree” - three omni microphones separated by one meter. A source on the left will give three outputs identical in level and differing by time delay. On playback, a listener on the far right will hear this instrument coming from the right loudspeaker. This listener will hear every instrument coming from the right. Amplitude panning outside the sweet spot. If you record with three widely spaced microphones, an instrument on the left will have high amplitude and time differences in the output signals. A listener on the far right will hear the instrument on the left. Now the orchestra spreads out across the entire loudspeaker basis, even when the listener is not in the sweet spot. Training to hear envelopment • To test for envelopment it is essential that you move around the room, and that you face different directions! • You must fill the WHOLE room with the sound of the original hall, and it must work when you face all directions. • Reproducing the original hall only in front, or only in the rear, will not do the job. • The ability to reproduce the original hall acoustics in a small space is one of the biggest advantages of 3/2 surround! 3/0 versus 3/2 • • It is obvious that using three speakers in the front is better than two speakers, particularly if we use amplitude panning. Why do we need two additional speakers and channels for the rear, particularly if we are only reproducing reverberation? Mono sounds poor because it does not reproduce the spatial properties of the original recording space. With decorrelated reverberation a few spatial properties come through, but only if the listener faces forward. And the sense of space is stronger in the front. We need at least four speakers to reproduce a two dimensional spatial sensation that is uniform through the room. The Polyhymnia Pentangle • The Polyhymnia engineers employ a surround array of spaced omni microphones, at a spacing similar to the ITU playback array. • The technique works well in spaces where the reverberation radius is equal to or greater than the microphone spacing. • In this case the direct sound picked up by the rear microphones is perceived as an early lateral reflection and the adds distance to the front image. • Caution!! In a small hall this array will be TOO MUDDY!!! Video Goals • 1. Minimalist Videography: No arbitrary selection of video image – If an instrument is playing, it should be visible on screen. – The viewer should be able to decide which performer to watch, – and can watch a different performers every time the video is seen. • 2. Achieving the first goal requires sufficient resolution that each performer can be seen at the same time! – A 1280 by 720 pixel image is sufficient to convey the emotion from over 100 performers at the same time. – But if this goal is to be achieved with current video equipment great care must be used! • 3. A screen size appropriate to the audio image. – The Verdi Requiem is a LARGE piece. – It needs a Large screen if the video is to be effective. • Ideally as large as the front loudspeaker basis. Ground rules • For this recording the hall and the musicians union had a number of requirements: – 1. There could be only ONE video recording. – 2. There could be only ONE camera position. – 3. There could be a maximum of 10 microphone lines from the ceiling. – 4. There was only 2 hours available for setup, both for the audio and the video. • At the last minute, my assistant could not come. Concert Hall Note the large, reverberant stage house in Jordan Hall, Boston Stage house reverberation • Reverberation in Jordan Hall (when fully occupied) is dominated by the stage house. • Reverberation radius (the distance at which reverberation and direct sound are equal) is under 3 meters. • Microphones MUST be placed close to performers or the sound is muddy!!! • Directional microphones are necessary, and hypercardioid or supercardioid are helpful. Why use a Main Microphone? • Most engineers are taught that for classical music they MUST use a main microphone. • It is a PRIMARY rule in both science and art to always ask WHY such a technique contributes to the artistic goals … – What is the artistic or psychoacoustic reason to employ this device? – For a small group – where the microphone is close to the musicians compared to the reverberation radius – a main array can give good results. – But it is also required that the natural acoustics are appropriate for the piece being performed. • When these conditions are not met… We must do something more effective! Main Microphone - NOT • When the critical distance (hall radius) is smaller than the microphone-to-source distance a main microphone does more harm than good! – It is NOT possible to record the direct sound from a large group with a single microphone array! – Main microphone arrays typically use omnidirectional microphones • Omnis are only beneficial when the reverberation is both low in level and beautiful. • This is almost never true when a large group occupies a small hall. • Omni microphone arrays cause the low frequencies to be monaural unless they are widely spaced. • Monaural bass is anathema to good sound. • Using such an array would waste precious microphone lines. • In this case we need to record the direct sound as best we can, and use technology to give the sound both depth and reverberation Microphones: Directional Microphones only 4x – Schoeps CCM 40 Cardioid 4x – Schoeps CCM 41 Supercard (note the simple cable adaptor) 2x – Neumann KMF-4 Cardioid with stand adaptor 2x – Schoeps Collettte Cardioid/Omni – set to Cardioid Notice all microphones are small – easily concealed from the video camera. Microphone placement All microphones were on stage – none in the audience. All were hanging except the soloist microphones, which used the Neumann stand adaptor on a short stand. Where possible the microphones point toward the audience. Audio Equipment • The author believes in the validity of the sampling theorem: – Which states that 44.1kHz is an adequate sampling rate to record 20kHz. • The author also believes that frequencies above 20kHz have NO musical importance. – Please try to prove me wrong… • Mixing 12 or 16 tracks together to create a 5 channels does NOT decrease the signal to noise ratio of the final product. – If we want a final product with a 16 bit S/N, we can achieve this result with a 16 bit multitrack recording – if the original tracks are correctly recorded and mixed. • So – with no apologies, the author’s ancient Yahama O3D and two Tascam machines were used for the recording. – These machines are reliable and quick to set-up Mixing setup The original 12 tracks were played on two Tascam machines, mixed with the O3D, and recorded on a third Tascam. Reverb used a Lexicon 480L for early reflections, and a Lexicon MC-12 “Live” program for the main hall. Mixing • Mix was done in real-time, using punch-in on the recording Tascam – Synchronized punch-in allows for correction of the mix – You can re-do individual sections until perfect • Monitoring was on Infinity Prelude MTS speakers, with a Revel center speaker. • The sound of these speakers in this room is fabulous. • After mixing the sound was transferred digitally to a computer, where some level adjustments and equalization was done. – Most wind noise was removed by careful filtering at this stage. • When the pitch of a soloist needed correction, a separate mix was made of all the microphones without the soloist, with the soloist solo on a separate channel. – The pitch was then corrected in the computer, and the soloist was replaced into the mix, while adding pitch-corrected early reflections. Mixing Goals • A mix has THREE basic elements: – The direction and balance of the Direct Sound – The perception of distance or depth in the sound image – The perception of the surrounding hall • All three must be correct to make a great recording – And all three can be separately adjusted by mixing. Direct Sound • Good directional localization over a large listening area requires good separation between channels. – We want to avoid leakage between microphones, and we want to avoid pan pots where possible. • In this case the orchestra microphones were aimed toward the audience to avoid leakage from the chorus. • This also avoided pickup of the nasty stage house reverberation. – For this recording there were not enough microphone lines to use dedicated center channel microphones • So left-center and center-right panning was used for the soloists and the center microphone pairs on the orchestra and chorus. – Otherwise microphones were mixed into only a single channel. • The two microphones at the outside front of the orchestra were directed to the surround channels Center Channel • All front microphones are panned center/left or center/right. – No phantom image – The center channel is vital to achieving a large listening area – The center chorus microphone pair (CCM 41) and the soloist microphones are panned in this way. • The result is an even spread of the chorus from left front to right front • The soloists are clearly localized half-way between the center and the left or right loudspeakers. • Beware a bug in Sony Vegas! – The center surround channel is mixed equally to left and right front as well as the center! • A second bug causes clipping unless all channels are reduced in level. • To fix both bugs I set the center channel to 0dB into the center output, and all others to -6dB. The center is then added with negative phase into the left and right front channels at a level of -6dB. This cancels the cross-talk. Surround Channels • To give added excitement and envelopment a “conductor’s perspective” was used. – The outer orchestra microphones (CCM 40) are directed to rear left and rear right. – The orchestra then surrounds the listener, with the chorus in the front only. – The woodwind microphones are panned to the front. – This arrangement is particularly effective during the “Tuba Miram” section where the offstage and onstage trumpets sound all around the listener. • This passage is the trumpet call that announces the end of the world and the beginning of the last judgment. • The recording invokes this emotion very well – and the video strongly enhances the emotional power. Depth • Depth perception in a recording comes from early reflections – both in the medial direction (mono) and in other directions. – In recording it is always preferable to use reflections from other directions, as these add depth without muddiness. – Ideally we want the early reflection field to be uniform through the room • Then the depth perception will be equal and natural for all the listeners. • Thus we want similar reflection amplitude in all the outside speakers. • Some commercial equipment generates this pattern by default. – Use it! If you don’t have such a device, cobble it together from what you have! – You can use a pair of echo sends to separately control the perceived depth of each element of the mix. Early reflections and muddiness • Early reflections that come from a different direction from the direct sound add depth and perspective. – They can also add muddiness if there is too much • Early reflections that come from the same direction cause muddiness. – Leakage – for example of the chorus into the orchestra mikes – adds muddiness because the leakage is identical to an early reflection from the same direction. – Thus we use close-miking to reduce the reflected energy in each microphone. • And try to reduce leakage by microphone orientation and placement. – We add early reflections electronically into the outer loudspeakers. • In practice, we control the depth perspective by adding early reflections using the echo sends (in stereo) to the 480L running “large surround” with the reverb level off and the early reflection level at maximum. – The returns are routed to front L&R and to rear L&R – The center channel is unused for reverberation and early reflections because it only adds muddiness. Muddiness: Dry Speech + 40ms reflections Mono speech: The sound is clear, but much too close to the loudspeaker. Speech with ~40ms allpass reflections and no direct sound. Mono: Stereo: Note both the mono and the stereo version sound muddy and distant. There is no phantom image in the stereo version. Reflections used in these experiments The reflections used in these experiments form a decaying burst which peaks about 25ms after the direct sound, and has largely decayed away by 50ms. The reflections are different in the two channels, and have a flat frequency response. Depth without Muddiness • Dry speech – Note the sound is uncomfortably close • Mix of dry with early reflections at -5dB. – The mix has distance (depth), and is not muddy! – Note there is no apparent reverberation, just depth. • Same but with the reflections delayed 20ms at -5dB. – Note also that with the additional delay the reflections begin to be heard as discrete echos. • But the apparent distance remains the same. • Same but with the reflections delayed 50ms at -3dB – Now the sound is becoming garbled. These reflections are undesirable! – If the speech were faster it would be difficult to understand. • Same but with reflections delayed 150ms at -12dB – I also added a few reflections between 20 and 80ms at a level of -8dB to smooth the decay. – Note the strong hall sense, and the lack of muddiness. Demo Depth in the Mix • Solo mikes alone • Solo mikes with leakage • Solo mikes with early reflections added • Full mix Hall and Envelopment • Hall reflections (late reverberation) also needs to come equally from all directions in the mix. – Ideally the reverberation level and decay profile should be the same in all the outer speakers. – Once again – this type of reverberation output is available in some commercial equipment be default. – In a good hall, such a reverberation pattern is available from the “Polyhymnia Pentangle” or the “Hamasaki Square” • Demo Decorrelated bass The Ideal Reverberation – has 20ms to 50ms reflections with a total energy -4dB to -6dB – has relatively little energy from 50 to 150ms. Measured Early reflection amplitude Impulse response of the direct sound and early reflections as generated by the Lexicon equipment. Impulse recorded during the mix by sending a pulse into the right soloist microphone channel. Note the early reflections appear to be at a very low level. This appearance is misleading. If we integrate this picture with a 22.5ms window (which is how the ear hears it) we see the direct sound dominates the early reflections, but not by much. Experiments show the ideal level for the total energy in the early reflections is -6dB to -4dB. We see the levels used here are close to this ideal. Hall Reverberation • The hall reverberation should be primarily LATE reverberation. • The ideal reverberation has high values of very early reflected energy, followed by a strong late decay. • Using the 480L, a “spread” value over 100 is recommended. • The MC-12 “LIVE” program sounds better. I used a “shape” value of 3, and a “spread” value of over 100. The “size” was set to 32, and the RT was 1.9 seconds. • Demo – best to hear it!! Audio Secrets 1 Directional microphones roll off the low frequency response predictably. The response of each microphone is measured at a distance of ~3 meters. Each microphone is equalized using the measured data at the time of the recording. This curve is for the CCM 40. The result is excellent bass response with directional microphones Audio Secrets 2 This is the equalization applied to the Hall reverb return for the rear channels. Note that the Low frequencies are boosted below about 150Hz, and the high frequencies are reduced above 4kHz. This equalization keeps envelopment high, while preventing localization of the reverb to the rear. The reverb return to the front left and right boosts the bass, but the treble is flat Audio Secrets 3 Typical loudspeaker response rolls off the treble. Since the microphones are flat, it is useful to boost the high frequencies in the front channels. A small amount of bass boost is also added – beyond what is needed to correct for the directional microphones. Video • Western classical music consists of many musical lines of equal importance. • Conventional video assumes the viewer has a tiny, low resolution video screen, and a short attention span. – The result is brief close-up pictures of a single violin, alternating with the conductor’s nose, or the tongue of an opera singer. • We take another approach: – What resolution is needed to convey all the musical lines? – What screen size? – We assume the viewer is interested in all the music, and may want to view the performance several times. – We need to see all the performers, all the time, and let the viewer decide which performers to watch. • Demo – PCP Schostakovich. Four performers on stage, DVD quality. • With 100+ people on stage, we need HIGH DEFINITION High Definition • US HDTV broadcast is “1080i” – Typically this means a 16x9 picture with 1440 horizontal pixels, and 1080 vertical pixels. • This implies a rectangular pixel, with greater resolution vertically than horiziontally. – The Horizontal lines are interleaved, with 540 lines per field, and 60 fields/second. – Interleaved fields only work well with CRT projectors, and not with any digital display. • All digital displays must “deinterlace” the picture. • The most common digital display format is “720p” – 720p has 1280 horizontal pixels by 720 vertical pixels, using a square pixel for a 16x9 picture. • HD cameras come in two types – 1080i, and 720p. In practice both yield about the same resolution. Resolution • Resolution – the number of lines a camera (or display) can reproduce – depends on intrinsic resolution and on contrast. Some factors are: – – – – – – The sharpness of the lens and the accuracy of focus The number of lines in the sensor The bandwidth of the video readout circuits The method of video compression The noise level in the sensor How the sensor data is read out. • All these factors affect resolution and contrast! HD professional vs HDV consumer • Professional HD cameras use three sensor chips, typically 2/3” wide. – These chips are expensive, and require large, relatively expensive lenses. – The advantage is that a large lens gathers a lot of light. Each pixel in the sensor gets a healthy number of photons. • The result is low video noise. • HDV (consumer) cameras use 3/8” sensors. – The lenses are smaller, lighter, and less expensive. – But the video noise is higher. – The more pixels on a chip (for high resolution) the higher the noise and the lower the effective film speed. – HDV cameras attempt to overcome video noise by delivering lower resolution than they claim. – HDV cameras use MPEG video compression in the camera to allow storing a HD image on standard DV cassettes – but this degrades resolution. Sony HVR-Z1U • The “professional” Sony HDV camera claims to be 1080i. – The sensor is 920 pixels horizontal by 1080 vertical, with one sensor for each of 3 colors. – The green sensor is offset from the other two by ½ a horizontal pixel, giving a theoretical maximum resolution on a black and white image of 1440x1080. • But the edge contrast is poor. • To reduce video noise, adjacent vertical pixels are averaged together to form each field. – The result is low edge contrast in the vertical direction. • MPEG compression further reduces resolution and contrast. • The electronic image stabilization “steady shot” reduces the theoretical resolution to ¼ of the available pixels. – But you can (and must) turn it off. • In practice, the resolution is about 1200 pixels horizontally by 800 vertically – similar to 720p – but the edge contrast is very low. Deinterlace • The Sony camera delivers a low-contrast interlaced image. – Viewing the image without deinterlace is quite unpleasant. Note the “jaggies” on the conductor’s hands. Most digital displays deinterlace by blending fields, reducing the vertical resolution to 540 pixels at best. For best results, we must use “smart” deinterlacing, which only blends pixels that are different in each field. Smart Deinterlace • Sony Vegas editor does not include smart de-interlace. But Mike Crash in Czechoslovakia has written a nice one (free). Here is the same picture de-interlaced by Mike Crash. The picture has also been sharpened by two unsharp mask plug-ins. Some increase in video noise is visible – this is less problematic when the image moves. Sharpening • The low edge contrast in the Sony camera can be improved by using two “unsharp mask” plug-ins – at 1440x1080 pixels. The right side of this image is deinterlaced and sharpened, the left side is direct from the camera. I use two masks in series, the first with amount 1, the second with amount 0.5. Test Patterns 500 lines horiz. 800 lines horizontal 500 lines horiz. 800 lines horizontal Raw data from camera expanded 2x Same data after sharpening with two unsharp note the pixels are not square masks, amount 1 and amount 0.5 Notice the excess sharpening at 500 lines. The sharpening takes time… • The cost of sharpening the video is computer time. • The unsharp mask works in two dimensions, and increases the contrast between adjacent pixels, one pixel at a time. • When combined with smart deinterlace, the calculations can take more than two seconds per frame. • It takes about a week of computer time to sharpen the Verdi requiem on a 3GHz Pentium 4. • Sony Vegas must be set for the native camera resolution of 1440x1080 for the sharpening operation. Scaling • USA HDTV and the Sony camera is 1080i – This uses a rectangular pixel, 1.33x1 – And has 1440 pixels horizontal, 1080 vertical • Most displays use a square pixel, and have lower resolution. – Most of the best current displays are 1280x720, with a square pixel. • Thus the image must be scaled before it can be displayed. • Scaling is similar to a sample rate conversion. – The sample rate is the number of lines. Sample rate conversion • We know how to sample rate convert audio: – Up-sample to a multiple of the current sample frequency – Low pass filter to smoothly fill in the missing samples – Interpolate to a multiple of the new sample rate – Low pass filter and down-sample. • This process is far to complex to use for video. – Standard video scaling introduces many artifacts Scaling artifacts Down-sample the sharpened data from 1080i to 720p The same down-sampling, but adding a single unsharp mask Notice that the loss of resolution and edge contrast is partly or mostly restored by adding a single minimum radius unsharp mask. The scalers in displays (and in Sony Vegas) do not do this. We have to do it in a second pass during the rendering process. This adds more computer time! Screen size • A high resolution image is of little use if the screen is small and far away. – Even if the viewer can perceive the detail, the emotional power may be lost. • For audio/video a minimum screen size fills the distance between the front loudspeakers, or +-30 degrees. – Research by Kimio Hamasaki at NHK shows that larger screens can be even better. • Alas – such a large screen may be uncomfortable to watch with standard DVDs. – Current cinematography assumes a smaller screen. • It is not clear how to resolve this dilemma. – But with larger screens at lower prices, we may see a shift in how movies are made and viewed. Conclusions • It is possible to make videos of music performances that re-create the excitement and involvement of a live performance. – The sonic goal is to capture the direct sound of each instrument clearly and with low leakage between sections. – Video with minimalist cinematography can be very effective when the resolution is sufficient to capture the emotions of the performers. – For a string quartet DVD quality video is adequate • But for large forces much higher resolution – and larger screen sizes – are needed. – Current technology is just barely able to do the job – but care must be taken with every step – including: • • • • Cinematography Sharpening Scaling Projection