4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003), April 2003, Nara, Japan FIRST-ORDER DIFFERENTIAL BEAMFORMING AND JOINT-PROCESS ESTIMATION FOR SPATIAL SOURCE SEPARATION P. Gómez, V. Nieto, A. Álvarez, R. Martínez, F. Rodríguez, F. Díaz, V. Rodellar Departamento de Arquitectura y Tecnología de Sistemas Informáticos, Universidad Politécnica de Madrid, Campus de Montegancedo, s/n, 28660, Boadilla del Monte, Madrid, tel.:+34.91.336.7384, fax: +34.91.336.6601, e-mail: pedro@pino.datsi.fi.upm.es ABSTRACT Speech Enhancement is a technique required to grant the success of speech recognition systems working under strong noisy conditions, and to grant understandability in speech transmission and coding. Array beamforming has been traditionally used to produce improvements in the signal-to-noise ratio. Two-sensor systems based on FirstOrder Differential Beamformers (FODB) have been proposed as a promising alternative [G. Elko, 1996]. Nevertheless null beamformers are not sufficient to grant enough separation levels. Through this paper FODB’s and Joint-Process Estimators (JPE’s) are combined to grant speech source separation. Results for superposed sinusoidal sources are presented. where the parameter β is the steering factor of the FODB controlling the DOA (see [1], [3], [4] for further details) (see Figure 2). The output of the filter will be defined in general as: (2) y = xF ( ϕ i ) x being the equivalent input to the FODB, which may be evaluated from the signals arriving to each sensor. Therefore, the output of the FODB will contain information coming from any DOA except from ϕi. si sik φi A First-Order Differential Beamformer (FODB) is a structure using two microphones and a combination of signals such as to produce a given null at the Direction of Arrival (DOA) ϕi as given in: F ( ϕ = ϕ i ) = 1 − δ ( ϕ i ); − π ≤ ϕ i ≤ π Dirac’s delta function. This structure may be used as shown in Figure 1 to provide source separation accordingly with the respective DOA of the impinging source. x2 Cardioid Compensation β y - Subtractor + ε x Figure 1. Simplified structure of the Source Separator (SS), x1 and x2 being the outputs of both microphones m1 and m2, x the output of the equivalent cardioid microphone, y the beamformer output, and ε the estimation of the detected source. sim S S (1) ϕi being the angular DOA where source si is located, and δ FODB sj sjk 1. INTRODUCTION φi x1 sjm x y ε Figure 2. Source Composition model, si and sj are two (real) sources, sik and sjm being the respective multiple-path arrivals (apparent sources) corresponding to each real source. The sources in Figure 2 are divided in primary (real sources), as si associated to ϕi or sj at ϕj or secondary (apparent sources: multiple-path arrivals, reverberations) as sik or sjm. With this in mind the following hypotheses will be established: • Sources are mutually independent in a statistical sense (orthogonal to correlation). • Reverberations are dependent to their corresponding sources within a given time-lag. • Reverberations corresponding to one source are independent from those corresponding to another. 721 From these assumptions the following definitions will be introduced: • Let S be the set of all sources (real or apparent) inducing signal on both microphones m1 and m2, defined by the pair (ϕj, sj,n): s j ,n ∈ ℜ n : ϕ j ∈ [− π ,π ] . • Let Sid be the set of sources (real or apparent) dependent to the given source si: (3) S id = s j ∈ S : E si ,n s j ,n + k ≠ 0; ∀k ∈ Z { { } channel inequalities and delays present in both x1 and x2, rendering it impossible to subtract simply one trace from another. Instead, a more accurate procedure based on the method of projections between signals by means of jointprocess estimation is used, as exposed in Figure 3.a. x y } { } { E {x } }= 0; ∀k ∈ Z ⇒ x E xi ,n xio,n + k = 0 ; ∀k ∈ Z ⇒ xi ,n ⊥ xio,n + k d o i ,n xi ,n + k d i ,n ⊥ xio,n + k This means that the input signal of the FODB may be split into two parts, mutually independent to each other within a time span, these being xi + xid (contributions associated to source si, direct and multiple-path) and xio (contributions from other sources, direct and multiplepath). Source in ϕi none si + xio,n ε=x-y →0 xid,n + xio,n → xi ,n x xid,n + y xio,n xi ,n + xid,n + xio,n xid,n The situation reflected by the table above when the FODB is aimed to a certain angular DOA given by ϕi will comprise two possible cases. If a source is present at that DOA the output of the subtractor will be non-null, and could be estimated as x̂i . If there is only reverberation present or multiple-path contributions from other sources, the output of the subtractor will be much less strong. Estimators of second- and higher-order statistics of xi may give hints on where the sources come from. 2. SOURCE SEPARATION METHODOLOGY The implementation of source separation can not be accurately implemented by simple subtraction, as there are xo e yn FODB x2,n order-K Lattice Filter {bk,n} xn + xd,n Ladder Filter xo,n {βi,n} DOA Detection & FODB Steering b) Figure 3. a) JPE used. b) Whole structure implementing source separation. A joint-process estimator (JPE) may be seen as a system projecting an input signal s on a reference signal r, producing an output which is the estimation of s on r given by: ŝ = ℑ K {s , r} (7) (8) JPE r x1,n } • Let xi be the component of x contributed by the source being aimed to at ϕi, si. • Let xid be the component of x contributed by Sid, or dependent component: xid = ∑ h( s j ) (5) ∀s jx∈Scontributed • Let xio be the component of by Sio or id orthogonal component: xio = ∑ h( s j ) (6) ∀ s ∈ S j io function explaining where h(sj) is assumed to be a linear the influence of the propagation media, sensor transfer function, and pre-processing stages on the incoming sound. As a consequence of the above, it will be assumed that the following properties hold: xd JK{s,r} a) • Let Sio be the set of sources (real or apparent) independent to the given source si: (4) Sio = s j ∈ S : E si ,n s j ,n + k = 0; ∀k ∈ Z { s (9) and an estimation error, given by: e = s − ŝ = s − ℑK {s , r} (10) where ℑK {*,*} is the linear operator representing the projection performed by the JPE, implemented as an adaptive filter as shown in Figure 4. s(n)=x(n) Σ e1 e2 Σ Σ Σ r(n)=y(n) Σ g1 g0 b0 f0 Stage 1 Σ g2 b1 f1 Stage 2 eK Σ Σ gK-1 bK-1 fK-1 b2 f2 eK+1=xo(n) xd(n) Σ gK Stage K bK fK Structure of the Gradient Lattice-Ladder Equalizer bk-1(n) fk-1(n) z-1 ck bk(n) fk(n) Structure of a given Lattice stage Figure 4. Detailed structure of the Lattice-Ladder filter showing the general architecture (top) and the data flow diagram of each stage (bottom). It is well known that when the operator ℑK {*,*} has been optimally adapted the norm of the estimation error will be minimum in a least squares sense [5]: s − ℑo {s , r} = min s − ℑ{s , r} 722 (11) In what follows, it will be assumed that the process of JPE has been carried to this condition, under which the following orthogonalization properties hold: E{en ŝn + k } = 0; 0 ≤ k ≤ K ⇒ e ⊥ ŝ E{en rn + k } = 0; 0 ≤ k ≤ K ⇒ e ⊥ r (12) (13) where K is the order of the adaptive filter, which will be used to extract xi subtracting y from x. This situation is described in Figure 5 below. ŝno r1,n 4. RESULTS AND DISCUSSION To check the practical viability of the described methodology a situation where three sinusoids of equal amplitude and frequencies of 0.5 kHz, 1 kHz and 2 kHz arriving from -12.25º, +12.25º and 0º (far field) was simulated. The sampling frequency was assumed to be 11,025 Hz. The spectral density of the resulting composition at the FODB input (xn) is given in Figure 6. sn eno for which the orthogonalization properties of the JPE are to be exploited. A very delicate issue is that of the JPE order. It has been implied that the order of the JPE’s to estimate x̂i ,n (K1) and x̂ j ,n (K2) meet the following condition: K 1 << K 2 << N where N is the size of the signal frame. Practical values may be K1=8, K2=32 for N=512. r0 ,n M2 Figure 5. Under optimal conditions the error signal eno is rendered orthogonal to the estimation of the input signal sn defined by the plane M2. With this in mind the JPE will recover the common components between the reference yn and the input xn signals, and produce an error which will be the uncorrelated (or complementary) part between x and y, for which the following associations are established: (14) s = xn (15) r = yn ŝ = xnd = ℑ{xn , yn } (16) x̂i ,n ; ϕ = ϕ si e = xno = → 0 other DOA' s (17) This set of relationships is implemented by the structure given in Figure 3.b. The lattice-ladder filter [7] algorithm supporting the structure given in Figure 4 will be used. Figure 6. Power Spectral Density of the input equivalent signal to the FODB: xn. This signal is processed by the FODB, generating an output given by yn, which may be seen distributed all over the angle span in Figure 7. 3. APPARENT AND COMPLEMENTARY SOURCES Assuming that the process of evaluating x̂i ,n (estimator of the source contribution xi ,n ) is accurate enough, a new step-forward could be given in the direction of evaluating also x̂id,n and x̂io,n (respective estimators of xid,n and xio,n ). For such, properties (7) and (8) are to be re-called, exploiting JPE again. In this case, as xid,n and xio,n are components of yn , they may be evaluated by projecting yn on the source estimate x̂i ,n used as reference, i. e.: s = yn r = x̂i ,n = (18) xno { ŝ = xid,n = ℑ yn , xno } x̂ ; ϕ = ϕ sj ; ∀j ≠ i e = xio,n = j ,n → 0 other DOA' s (19) (20) (21) Figure 7. Angular span of the FODB output signal yn, where the steering factor has been operated in 101 channels over the angular range. 723 Figure 8. Angular span of xnd (JPE output dependent of yn). Figure 9. Angular span of xno (JPE output orthogonal to yn). Figure 11. Angular span of xio,n (part of yn orthogonal to x̂i ,n ). It may be commented that the signal in Figure 10 ( xid,n ) may be composed by multiple-path contributions, or by the amount of signal contribution from source si which has not been removed from yn by the FODB because the real behavior of the system differs from the ideal model implied in (2) due to the effective bandwidth of the notch, which is not null (as it should be in the ideal case). On its turn, the signal in Figure 11 ( xio,n ), may be considered as the set of complementary sources for each DOA to which the FODB has been steered to aim to. The problem to face now is: considering that all the signals present in the problem are available in the angular spans of x̂i ,n , xid,n and xio,n , specific DOA’s have to be determined to select the true output signals to give effective solutions to the source separation problem. Several statistics may be used for such purpose, one of them being the energy distribution of x̂i ,n . Other possible candidates are the Cumulative Logarithmic Angular Distribution (CLAD) of x̂i ,n or xio,n , this last one being defined as: ∫ ω2 Cx o ( ϕi ) = log10 X io ( ϕ i ,ω )dω ω1 i ,n (22) where ω1 and ω 2 are the limits of the frequency span considered, and: X io ( ϕ i ,ω ) = 1 N − 1 o − j ωn ∑ xi ,n e N n =0 (23) The CLAD may be seen as the overlapping of the angular profiles of the output signals. The minima of this function mark those DOA’s from which energy has been removed, and therefore point to possible indicators of the presence of real sources: { [ ]} ϕ im = arg min C x o ( ϕ i ) Figure 10. Angular span of xid,n (part of yn dependent of x̂i ,n ). 724 i ,n (24) As the CLAD may present multiple minima, some criterion has to be used to determine which are the most reasonable ones. This is done measuring the slenderness or acuteness of the minima. In the case considered, three main minima are detected using this principle, positioned on the angles given in the following table: Channel Ang. position 34 -12.5027º 51 0º 68 +12.5027 Table 1. Angular positions for the three minima of the function in (22) giving the estimation of real source DOA’s. When the DOA’s given in the table above are used as input arguments in the angle spectrogram of the estimated source arrival: X i ( ϕ im ,ω )dB = 20 log10 X̂ im ( ϕ i ,ω ) Figure 14. Power spectral density of x̂i ,n for ϕI= +12.25º. The spectral line corresponding to 2,000 Hz. has been enhanced. (25) the power spectral densities given in Figure 12, Figure 13 and Figure 14 are found. On its turn, when the angle spectrogram of the orthogonal component of yn: o X io ( ϕ im ,ω )dB = 20 log10 X im ( ϕ i ,ω ) (26) is searched for the same DOA’s, the power spectral densities given in may be found. o Figure 15. Power spectral density of xi ,n for ϕI=-12.25º º. The spectral lines complementary to the one of 500 Hz. have been enhanced. Figure 12. Power spectral density of x̂i ,n for ϕI=-12.25 º. The spectral line corresponding to 500 Hz. has been enhanced. Figure 16. Power spectral density of xio,n for ϕI= 0º. The spectral lines complementary to the one of 1,000 Hz. have been enhanced. A last check was carried out to contrast the validity of the hypotheses implied by conditions (7) and (8), plotting the cosine of the angles between x̂i ,n and xio,n : cos( x̂i ,n , xio,n ) = Figure 13. Power spectral density of x̂i ,n for ϕI= 0º. The spectral line corresponding to 1,000 Hz. has been enhanced. 725 { E x̂i ,n , xio,n x̂i ,n xio,n } (27) and x̂i ,n and yn : cos( x̂i ,n , yn ) = E {x̂i ,n , yn } (28) x̂i ,n yn The results are given in Figure 18. This means that the FODB output is statistically independent from the detected source (complete separation) at these points. This property may be used for DOA detection. This promising result is being studied more deeply and the results obtained are to be extended to other situations with signals in a real acoustical environment. 5. ACKNOWLEDGMENTS o Figure 17. Power spectral density of xi ,n for ϕI= +12.25º. The spectral lines complementary to the one of 2,000 Hz. have been enhanced. This research is being carried out under grants TIC990960 and TIC2002-02273 from the Programa Nacional de las Tecnologías de la Información y las Comunicaciones (Spain), grant 07T-0001-2000 from the Plan Regional de Investigación de la Comunidad de Madrid, and a collaboration contract between Universidad Politécnica de Madrid and the Centre Suisse d’Electronique et de Microtechnique. 6. REFERENCES [1] Álvarez, A., Gómez, P., Nieto, V., Martínez, R., Rodellar, V., “Speech Enhancement and Source Separation supported by Negative Beamforming Filtering”, Proc. of the 6th ICSP, Beijing, China, August 26-29, 2002, pp. 342-345. [2] Elko, G. W., “Microphone array systems for handsfree telecommunication”, Speech Communication, Vol. 20, No. 3-4, 1996, pp. 229-240. [3] Gómez, P., Álvarez, A., Martínez, R., Nieto, V., Rodellar, V., “Optimal Steering of a Differential Beamformer for Speech Enhancement”, Proc. of EUSIPCO’02, Vol. III, Toulouse, France, 3-6 September, 2002, pp. 233-236. Figure 18. Cosine of the angles between the estimators of xi vs. xio , and xi vs. y. It may be seen that these angles keep around 90º for most of the angular span of interest, and reach the orthogonality at the same values, these coinciding strictly with the ones where the sources are located, as given in the table below: xio,n x̂i ,n vs (Channel #) 37 51 57 66 70 72 77 x̂i ,n vs yn (Channel #) 37 51 57 66 70 72 77 DOA (Angle) -10.2963 0 4.4127 11.0318 13.9736 15.4445 19.1217 [4] Gómez, P., Álvarez, A., Martínez, R., Nieto, V., Rodellar, V., “Time-Domain Steering of a Differential Beamformer for Speech Enhancement and Source Separation”, Proc. of the 6th ICSP, Beijing, China, August 26-29, 2002, pp. 338-341. [5] Haykin, S., Adaptive Filter Theory, Prentice-Hall, Englewood Cliffs, N. J., 1996. [6] Hyvärinen, A., Karhunen, J., Oja, E., Independent Component Analisis, John Wiley & Sons, New York, 2001. [7] Proakis, J. G., Digital Communications, Mc GrawHill, 1989. [8] Van Trees, H. L., Optimum Array Processing, John Wiley, N. Y. 2002. o Table 2. Positions where the estimators of xi vs. xi , and xi vs. y are mutually orthogonal. 726