FIRST-ORDER DIFFERENTIAL BEAMFORMING AND JOINT

advertisement
4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003), April 2003, Nara, Japan
FIRST-ORDER DIFFERENTIAL BEAMFORMING AND JOINT-PROCESS
ESTIMATION FOR SPATIAL SOURCE SEPARATION
P. Gómez, V. Nieto, A. Álvarez, R. Martínez, F. Rodríguez, F. Díaz, V. Rodellar
Departamento de Arquitectura y Tecnología de Sistemas Informáticos, Universidad Politécnica de Madrid,
Campus de Montegancedo, s/n, 28660, Boadilla del Monte, Madrid,
tel.:+34.91.336.7384, fax: +34.91.336.6601, e-mail: pedro@pino.datsi.fi.upm.es
ABSTRACT
Speech Enhancement is a technique required to grant the
success of speech recognition systems working under
strong noisy conditions, and to grant understandability in
speech transmission and coding. Array beamforming has
been traditionally used to produce improvements in the
signal-to-noise ratio. Two-sensor systems based on FirstOrder Differential Beamformers (FODB) have been
proposed as a promising alternative [G. Elko, 1996].
Nevertheless null beamformers are not sufficient to grant
enough separation levels. Through this paper FODB’s and
Joint-Process Estimators (JPE’s) are combined to grant
speech source separation. Results for superposed
sinusoidal sources are presented.
where the parameter β is the steering factor of the FODB
controlling the DOA (see [1], [3], [4] for further details)
(see Figure 2). The output of the filter will be defined in
general as:
(2)
y = xF ( ϕ i )
x being the equivalent input to the FODB, which may be
evaluated from the signals arriving to each sensor.
Therefore, the output of the FODB will contain
information coming from any DOA except from ϕi.
si
sik
φi
A First-Order Differential Beamformer (FODB) is a
structure using two microphones and a combination of
signals such as to produce a given null at the Direction of
Arrival (DOA) ϕi as given in:
F ( ϕ = ϕ i ) = 1 − δ ( ϕ i ); − π ≤ ϕ i ≤ π
Dirac’s delta function. This structure may be used as
shown in Figure 1 to provide source separation
accordingly with the respective DOA of the impinging
source.
x2
Cardioid
Compensation
β
y
-
Subtractor
+
ε
x
Figure 1. Simplified structure of the Source Separator (SS), x1
and x2 being the outputs of both microphones m1 and m2, x the
output of the equivalent cardioid microphone, y the beamformer
output, and ε the estimation of the detected source.
sim
S
S
(1)
ϕi being the angular DOA where source si is located, and δ
FODB
sj
sjk
1. INTRODUCTION
φi x1
sjm
x
y
ε
Figure 2. Source Composition model, si and sj are two (real)
sources, sik and sjm being the respective multiple-path arrivals
(apparent sources) corresponding to each real source.
The sources in Figure 2 are divided in primary (real
sources), as si associated to ϕi or sj at ϕj or secondary
(apparent sources: multiple-path arrivals, reverberations)
as sik or sjm. With this in mind the following hypotheses
will be established:
• Sources are mutually independent in a statistical sense
(orthogonal to correlation).
• Reverberations are dependent to their corresponding
sources within a given time-lag.
• Reverberations corresponding to one source are
independent from those corresponding to another.
721
From these assumptions the following definitions will be
introduced:
• Let S be the set of all sources (real or apparent)
inducing signal on both microphones m1 and m2,
defined by the pair (ϕj, sj,n): s j ,n ∈ ℜ n : ϕ j ∈ [− π ,π ] .
• Let Sid be the set of sources (real or apparent)
dependent to the given source si:
(3)
S id = s j ∈ S : E si ,n s j ,n + k ≠ 0; ∀k ∈ Z
{
{
}
channel inequalities and delays present in both x1 and x2,
rendering it impossible to subtract simply one trace from
another. Instead, a more accurate procedure based on the
method of projections between signals by means of jointprocess estimation is used, as exposed in Figure 3.a.
x
y
}
{
}
{
E {x
}
}= 0; ∀k ∈ Z ⇒ x
E xi ,n xio,n + k = 0 ; ∀k ∈ Z ⇒ xi ,n ⊥ xio,n + k
d o
i ,n xi ,n + k
d
i ,n
⊥ xio,n + k
This means that the input signal of the FODB may be split
into two parts, mutually independent to each other within
a time span, these being xi + xid (contributions associated
to source si, direct and multiple-path) and
xio (contributions from other sources, direct and multiplepath).
Source in ϕi
none
si
+ xio,n
ε=x-y
→0
xid,n + xio,n
→ xi ,n
x
xid,n
+
y
xio,n
xi ,n + xid,n + xio,n
xid,n
The situation reflected by the table above when the FODB
is aimed to a certain angular DOA given by ϕi will
comprise two possible cases. If a source is present at that
DOA the output of the subtractor will be non-null, and
could be estimated as x̂i . If there is only reverberation
present or multiple-path contributions from other sources,
the output of the subtractor will be much less strong.
Estimators of second- and higher-order statistics of xi may
give hints on where the sources come from.
2. SOURCE SEPARATION METHODOLOGY
The implementation of source separation can not be
accurately implemented by simple subtraction, as there are
xo
e
yn
FODB
x2,n
order-K Lattice
Filter
{bk,n}
xn
+
xd,n
Ladder Filter
xo,n
{βi,n}
DOA Detection &
FODB Steering
b)
Figure 3. a) JPE used. b) Whole structure implementing source
separation.
A joint-process estimator (JPE) may be seen as a system
projecting an input signal s on a reference signal r,
producing an output which is the estimation of s on r
given by:
ŝ = ℑ K {s , r}
(7)
(8)
JPE
r
x1,n
}
• Let xi be the component of x contributed by the source
being aimed to at ϕi, si.
• Let xid be the component of x contributed by Sid, or
dependent component: xid = ∑ h( s j )
(5)
∀s jx∈Scontributed
• Let xio be the component of
by Sio or
id
orthogonal component: xio = ∑ h( s j )
(6)
∀
s
∈
S
j
io
function explaining
where h(sj) is assumed to be a linear
the influence of the propagation media, sensor transfer
function, and pre-processing stages on the incoming
sound. As a consequence of the above, it will be assumed
that the following properties hold:
xd
JK{s,r}
a)
• Let Sio be the set of sources (real or apparent)
independent to the given source si:
(4)
Sio = s j ∈ S : E si ,n s j ,n + k = 0; ∀k ∈ Z
{
s
(9)
and an estimation error, given by:
e = s − ŝ = s − ℑK {s , r}
(10)
where ℑK {*,*} is the linear operator representing the
projection performed by the JPE, implemented as an
adaptive filter as shown in Figure 4.
s(n)=x(n)
Σ
e1
e2
Σ
Σ
Σ
r(n)=y(n)
Σ
g1
g0
b0
f0
Stage
1
Σ
g2
b1
f1
Stage
2
eK
Σ
Σ
gK-1
bK-1
fK-1
b2
f2
eK+1=xo(n)
xd(n)
Σ
gK
Stage
K
bK
fK
Structure of the Gradient Lattice-Ladder Equalizer
bk-1(n)
fk-1(n)
z-1
ck
bk(n)
fk(n)
Structure of a given Lattice stage
Figure 4. Detailed structure of the Lattice-Ladder filter showing
the general architecture (top) and the data flow diagram of each
stage (bottom).
It is well known that when the operator ℑK {*,*} has been
optimally adapted the norm of the estimation error will be
minimum in a least squares sense [5]:
s − ℑo {s , r} = min s − ℑ{s , r}
722
(11)
In what follows, it will be assumed that the process of JPE
has been carried to this condition, under which the
following orthogonalization properties hold:
E{en ŝn + k } = 0; 0 ≤ k ≤ K ⇒ e ⊥ ŝ
E{en rn + k } = 0; 0 ≤ k ≤ K ⇒ e ⊥ r
(12)
(13)
where K is the order of the adaptive filter, which will be
used to extract xi subtracting y from x. This situation is
described in Figure 5 below.
ŝno
r1,n
4. RESULTS AND DISCUSSION
To check the practical viability of the described
methodology a situation where three sinusoids of equal
amplitude and frequencies of 0.5 kHz, 1 kHz and 2 kHz
arriving from -12.25º, +12.25º and 0º (far field) was
simulated. The sampling frequency was assumed to be
11,025 Hz. The spectral density of the resulting
composition at the FODB input (xn) is given in Figure 6.
sn
eno
for which the orthogonalization properties of the JPE are
to be exploited. A very delicate issue is that of the JPE
order. It has been implied that the order of the JPE’s to
estimate x̂i ,n (K1) and x̂ j ,n (K2) meet the following
condition: K 1 << K 2 << N where N is the size of the
signal frame. Practical values may be K1=8, K2=32 for
N=512.
r0 ,n
M2
Figure 5. Under optimal conditions the error signal eno is
rendered orthogonal to the estimation of the input signal sn
defined by the plane M2.
With this in mind the JPE will recover the common
components between the reference yn and the input xn
signals, and produce an error which will be the
uncorrelated (or complementary) part between x and y, for
which the following associations are established:
(14)
s = xn
(15)
r = yn
ŝ = xnd = ℑ{xn , yn }
(16)
 x̂i ,n ; ϕ = ϕ si
e = xno = 
→ 0 other DOA' s
(17)
This set of relationships is implemented by the structure
given in Figure 3.b. The lattice-ladder filter [7] algorithm
supporting the structure given in Figure 4 will be used.
Figure 6. Power Spectral Density of the input equivalent signal
to the FODB: xn.
This signal is processed by the FODB, generating an
output given by yn, which may be seen distributed all over
the angle span in Figure 7.
3. APPARENT AND COMPLEMENTARY
SOURCES
Assuming that the process of evaluating x̂i ,n (estimator of
the source contribution xi ,n ) is accurate enough, a new
step-forward could be given in the direction of evaluating
also x̂id,n and x̂io,n (respective estimators of xid,n and xio,n ).
For such, properties (7) and (8) are to be re-called,
exploiting JPE again. In this case, as xid,n and xio,n are
components of yn , they may be evaluated by projecting
yn on the source estimate x̂i ,n used as reference, i. e.:
s = yn
r = x̂i ,n =
(18)
xno
{
ŝ = xid,n = ℑ yn , xno
}
 x̂ ; ϕ = ϕ sj ; ∀j ≠ i
e = xio,n =  j ,n
 → 0 other DOA' s
(19)
(20)
(21)
Figure 7. Angular span of the FODB output signal yn, where the
steering factor has been operated in 101 channels over the
angular range.
723
Figure 8. Angular span of xnd (JPE output dependent of yn).
Figure 9. Angular span of xno (JPE output orthogonal to yn).
Figure 11. Angular span of xio,n (part of yn orthogonal to x̂i ,n ).
It may be commented that the signal in Figure 10 ( xid,n )
may be composed by multiple-path contributions, or by
the amount of signal contribution from source si which has
not been removed from yn by the FODB because the real
behavior of the system differs from the ideal model
implied in (2) due to the effective bandwidth of the notch,
which is not null (as it should be in the ideal case). On its
turn, the signal in Figure 11 ( xio,n ), may be considered as
the set of complementary sources for each DOA to which
the FODB has been steered to aim to. The problem to face
now is: considering that all the signals present in the
problem are available in the angular spans of x̂i ,n , xid,n
and xio,n , specific DOA’s have to be determined to select
the true output signals to give effective solutions to the
source separation problem. Several statistics may be used
for such purpose, one of them being the energy
distribution of x̂i ,n . Other possible candidates are the
Cumulative Logarithmic Angular Distribution (CLAD) of
x̂i ,n or xio,n , this last one being defined as:
∫
ω2
Cx o ( ϕi ) =
log10 X io ( ϕ i ,ω )dω
ω1
i ,n
(22)
where ω1 and ω 2 are the limits of the frequency span
considered, and:
X io ( ϕ i ,ω ) =
1 N − 1 o − j ωn
∑ xi ,n e
N n =0
(23)
The CLAD may be seen as the overlapping of the angular
profiles of the output signals. The minima of this function
mark those DOA’s from which energy has been removed,
and therefore point to possible indicators of the presence
of real sources:
{ [
]}
ϕ im = arg min C x o ( ϕ i )
Figure 10. Angular span of xid,n (part of yn dependent of x̂i ,n ).
724
i ,n
(24)
As the CLAD may present multiple minima, some criterion
has to be used to determine which are the most reasonable
ones. This is done measuring the slenderness or acuteness
of the minima. In the case considered, three main minima
are detected using this principle, positioned on the angles
given in the following table:
Channel
Ang. position
34
-12.5027º
51
0º
68
+12.5027
Table 1. Angular positions for the three minima of the function
in (22) giving the estimation of real source DOA’s.
When the DOA’s given in the table above are used as
input arguments in the angle spectrogram of the estimated
source arrival:
X i ( ϕ im ,ω )dB = 20 log10 X̂ im ( ϕ i ,ω )
Figure 14. Power spectral density of x̂i ,n for ϕI= +12.25º. The
spectral line corresponding to 2,000 Hz. has been enhanced.
(25)
the power spectral densities given in Figure 12, Figure 13
and Figure 14 are found.
On its turn, when the angle spectrogram of the orthogonal
component of yn:
o
X io ( ϕ im ,ω )dB = 20 log10 X im
( ϕ i ,ω )
(26)
is searched for the same DOA’s, the power spectral
densities given in may be found.
o
Figure 15. Power spectral density of xi ,n for ϕI=-12.25º º. The
spectral lines complementary to the one of 500 Hz. have been
enhanced.
Figure 12. Power spectral density of x̂i ,n for ϕI=-12.25 º. The
spectral line corresponding to 500 Hz. has been enhanced.
Figure 16. Power spectral density of xio,n for ϕI= 0º. The
spectral lines complementary to the one of 1,000 Hz. have been
enhanced.
A last check was carried out to contrast the validity of the
hypotheses implied by conditions (7) and (8), plotting the
cosine of the angles between x̂i ,n and xio,n :
cos( x̂i ,n , xio,n ) =
Figure 13. Power spectral density of x̂i ,n for ϕI= 0º. The
spectral line corresponding to 1,000 Hz. has been enhanced.
725
{
E x̂i ,n , xio,n
x̂i ,n xio,n
}
(27)
and x̂i ,n and yn :
cos( x̂i ,n , yn ) =
E {x̂i ,n , yn }
(28)
x̂i ,n yn
The results are given in Figure 18.
This means that the FODB output is statistically
independent from the detected source (complete
separation) at these points. This property may be used for
DOA detection. This promising result is being studied
more deeply and the results obtained are to be extended to
other situations with signals in a real acoustical
environment.
5. ACKNOWLEDGMENTS
o
Figure 17. Power spectral density of xi ,n for ϕI= +12.25º. The
spectral lines complementary to the one of 2,000 Hz. have been
enhanced.
This research is being carried out under grants TIC990960 and TIC2002-02273 from the Programa Nacional de
las Tecnologías de la Información y las Comunicaciones
(Spain), grant 07T-0001-2000 from the Plan Regional de
Investigación de la Comunidad de Madrid, and a
collaboration contract between Universidad Politécnica de
Madrid and the Centre Suisse d’Electronique et de
Microtechnique.
6. REFERENCES
[1] Álvarez, A., Gómez, P., Nieto, V., Martínez, R.,
Rodellar, V., “Speech Enhancement and Source
Separation supported by Negative Beamforming
Filtering”, Proc. of the 6th ICSP, Beijing, China,
August 26-29, 2002, pp. 342-345.
[2] Elko, G. W., “Microphone array systems for handsfree telecommunication”, Speech Communication,
Vol. 20, No. 3-4, 1996, pp. 229-240.
[3] Gómez, P., Álvarez, A., Martínez, R., Nieto, V.,
Rodellar, V., “Optimal Steering of a Differential
Beamformer for Speech Enhancement”, Proc. of
EUSIPCO’02, Vol. III, Toulouse, France, 3-6
September, 2002, pp. 233-236.
Figure 18. Cosine of the angles between the estimators of xi vs.
xio , and xi vs. y.
It may be seen that these angles keep around 90º for most
of the angular span of interest, and reach the orthogonality
at the same values, these coinciding strictly with the ones
where the sources are located, as given in the table below:
xio,n
x̂i ,n vs
(Channel #)
37
51
57
66
70
72
77
x̂i ,n vs yn
(Channel #)
37
51
57
66
70
72
77
DOA
(Angle)
-10.2963
0
4.4127
11.0318
13.9736
15.4445
19.1217
[4] Gómez, P., Álvarez, A., Martínez, R., Nieto, V.,
Rodellar, V., “Time-Domain Steering of a Differential
Beamformer for Speech Enhancement and Source
Separation”, Proc. of the 6th ICSP, Beijing, China,
August 26-29, 2002, pp. 338-341.
[5] Haykin, S., Adaptive Filter Theory, Prentice-Hall,
Englewood Cliffs, N. J., 1996.
[6] Hyvärinen, A., Karhunen, J., Oja, E., Independent
Component Analisis, John Wiley & Sons, New York,
2001.
[7] Proakis, J. G., Digital Communications, Mc GrawHill, 1989.
[8] Van Trees, H. L., Optimum Array Processing, John
Wiley, N. Y. 2002.
o
Table 2. Positions where the estimators of xi vs. xi , and xi
vs. y are mutually orthogonal.
726
Download