Mathematical Representation of Joint Time

advertisement
Mathematical Representation of Joint Time-Chroma Distributions
Gregory H. Wakefield
Department of Electrical Engineering and Computer Science, University of Michigan
ABSTRACT
Originally coined by the sensory psychologist Roger Shepard in the 1960's, chroma transforms frequency into octave
equivalence classes. By extending the concept of chroma to chroma strength and how it varies over time, we have
demonstrated the utility of chroma in simplifying the processing and representation of signals dominated by harmonically-related narrowband components. These investigations have utilized an ad hoc procedure for calculating the
chromagram from a given time-frequency distribution. The present paper is intended to put this ad hoc procedure on
more sound mathematical ground.
Keywords: pitch, chroma, modal distribution, time-frequency distributions, auditory perception, harmonic sources
1. INTRODUCTION
in the early I 960's, Shepard reported that two dimensions rather than one are necessary to represent the perceptual
structure ofpitch [1]. He determined that the human auditory system's perception ofpitch was better represented as a
helix, than as a straight line, and coined the terms tone height and chroma to characterize the vertical and angular
dimensions, respectively. In this representation, as the pitch of a musical note increases, say from Cl to C2, its locus
moves along the helix, rotating chromatically through all of the pitch classes before it returns to the initial pitch class
(C) one cycle above the starting point.
According to Shepard's results, the pitch (p) in Hz of a signal can be factored into values of chroma (c) and tone
height (h)
p=2c+h
(1)
For this decomposition to be unique, it is sufficient for chroma to be a real number between 0 and 1 , and for tone
height to be an integer. Linear changes in c result in logarithmic changes in the frequency associated with the pitch.
By dividing the interval between 0 and I into twelve equal parts, the twelve pitches of the equal-tempered chromatic
scale are obtained. As the pitch continues to rise beyond the octave in the chromatic scale, the tone height increases
but the sequence of chroma values repeats. The implication of Shepard's representation is that the distance between
two pitches depends on c and h, rather than on p alone.
With respect to music, Shepard's findings are not surprising and instead, confirm the centuries-long common practice
of giving special emphasis to octave relations among notes in the musical scale. Indeed, music theorists use the terms
pitch class and octave number to denote chroma and tone height, respectively. We will continue to use Shepard's
terms, however, because of a more radical interpretation of Shepard's work that was advanced in the mid I 980's.
in the process of developing his Auditory Image Model in computational audition, Patterson generalized Shepard's
results tofrequency. Even though he substituted the Archimedian spiral for Shepard's helix as a basic representation
of frequency in the auditory system, the mapping from one dimension to two remains effectively the same. His pulseribbon model [2] transforms each temporal frame of the auditory image into an activity pattern along a spiral of ternporal lags, such that lag values along the same "spoke" of the spiral are octave multiples of each other.
Structurally, the frequency dual of a spiral decomposition using Patterson's lag variables is equivalent to the pitch
structure of Shepard's, e.g.,
f= 2c+h
(2)
where chroma denotes the angle along the spiral and tone height denotes the position along the common spoke. As in
the case ofpitch, a sufficient condition for this decomposition to be unique is that c e [0, 1 1 and h e Z, the setof
integers. Similar to ideas of pitch, certain frequencies under this system share the same chroma class if and only if
Part of the SPIE Conference on Advanced Signal Processing Algorithms, Architectures,
and Implementations IX. Denver, Colorado • July 1999
SPIE Vol. 3807 • 0277-786X/99/$1 0.00
Downloaded from SPIE Digital Library on 08 Feb 2010 to 165.124.180.206. Terms of Use: http://spiedl.org/terms
637
they are mapped to the same value of c. Thus, 200, 400, and 800 Hz share the same chroma class as 100 Hz, but 300
Hz does not.
Such a mapping ofa one-dimensional variable, like frequency or time lag, onto a two-dimensional coordinate system
appears, at first glance, to increase, rather than decrease, the complexity of an acoustic signal's representation. However, it implies a different type of geometry in the signal representation, which may have its own unique advantages.
Our work has studied the mathematical consequences of such two-dimensional representations of frequency as
chroma and tone-height, particularly with respect to the estimation of pitch; first, in the case of pitch contours of
Putonghua Chinese [3], and subsequently with respect to pitch processing in single- and multi-voiced music signals
[4,5]. Throughout this research, we have observed that chroma is not only useful as an intermediate representation in
the formation of pitch estimate, but as a representation, in its own right, of the acoustic signal.
2. THE CHROMAGRAM
The chromagram s(t, c) extends the concept of chroma to include the dimension of time. Just as we use the spectrogram s(t,J) to infer properties about the distribution of a signal's energy over frequency and time, the chromagram
can be used to infer properties about the distribution of a signal's energy over chroma and time. Two issues arise in
developing a mathematical representation for the chromagram. The first ofthese concerns definitions.
2.1. Definitions
Chroma is a mapping of frequency to the [0,1 ) interval. Analogous to the spectrum of a signal, we can define the
chroma spectrum, S(c) , as a measure of a signal's strength for a given value of chroma. The chromagram, S(t, c),
is a joint distribution of signal strength over the variables time and chroma.
2.2. Representation
The second issue concerns the representation ofthe chromagram. Our prior work in this area has considered the chromagram as a re-mapping of a time-frequency image, such as the spectrogram or some alternative time-frequency distribution, through some aggregation function G , e.g.,
s(t, c) = G(s(t,J);Vf 2c + h)
While this approach has proved useful in practice, we are interested in a more detailed development of a proper form
for the putative chromagram. In the following, we consider the operational properties of the chromagram mapping
and develop potential representations for such.
3. CHROMAGRAM DESIGN
The key concept ofchroma is that ofoctave equivalence. We hear the sinusoids 1 10, 220, 440, and 880 Hz, for exampie, as sharing the same pitch class ofA, even though they differ in their absolute frequencies. This property generalizes to multiple sinusoids so that the chromagrams of signals that are octave translates of each other should be the
same.
Accordingly, given two signals s and u such that
u(t,J) =
s(t,J2'2)
(4)
for integer n then
u(t,c) =
G[u(t,;Vf=2]
= G[s(t, ?);VX = 2ch1
In the construction of the set upon which G operates, since X = 12", the octave equivalence condition reduces to
Vf= 2c±h -,
(6)
Given that both h and n are integers, we have that the operator G will map the same octave sets of frequencies for
u(t,J) and s(t,J).
638
Downloaded from SPIE Digital Library on 08 Feb 2010 to 165.124.180.206. Terms of Use: http://spiedl.org/terms
What remains undetermined is whether G should depend on absolute frequency. That is, is there a critical frequency
beyond which a signal no longer conveys pitch? While opinions in the pitch perception literature differ concerning
this question, we appeal to the perceptual results of Shepard and claim that the chromagram for octave translates
should be the same; that is, G should be independent of h in the expansion ofthe frequency variable f. We return to
this assumption later in the paper to determine the effects of a cutoff frequency on chromagram analysis.
Accordingly, we have that
invariant Property. The chromagram is invariant to octave translation.
3.1. Relationship to the Scale Transform
Octave translation of a signal is achieved through time- or frequency-scaling, rather than frequency shifting. Specifically, following Cohen [6, p. 13], let T be the time operator
T=
—1do
(7)
I
then for the particular frequency o , frequency shifting by w
-ju0T
S(o)
SShIft(w) = e
= S(o+w0)
(8)
clearly doubles the desired frequency, but does not preserve any of the other ratios among frequencies within the
octave. That is
is a proper octave shift only at o =
Shft(w)
S(w + o)
S(o)
S(co)
9
o ; at any other frequency, o ao ,where a 1 . the ratio
(10)
Instead, to achieve a proper octave translation at all frequencies requires the scale operator
C lrdL°°idl
(11)
In this case,
=
Strans(O))
jlncNC
e
S(o)
= JS(ao)
(12)
such that for r = 2
Strans(ao)o)
[2S(2ao0)
(13)
for any pair a and 0)o
Given the goal of finding a distribution that is octave invariant, the scale transform, or its related joint time-scale
forms, would be a likely choice to construct the chromagram. Specifically, the scale transform [6, p. 246] is constructed from the (time) scaling operator as
D(c)
1
-jclnt
7=5s(t)—7--—dt
(14)
with respect to the scale variable c. In general, this is a complex function of c. To measure the signal's strength at a
given scale c, the magnitude squared of D(c) is taken, e.g.,
639
Downloaded from SPIE Digital Library on 08 Feb 2010 to 165.124.180.206. Terms of Use: http://spiedl.org/terms
P(c)
(15)
D(c)12
Unfortunately, the scale transform has the property that it is not only invariant to octave scaling, but to scaling by any
factor. Consider the effects of such scaling on a harmonic series built upon two different fundamentals, o and co2,
as might be encountered in speech or in music. Let
s(t) knk
for n
I , 2 . Because they
jkw
t
(16)
can each be built by re-scaling the time axis for a constant frequency o,, , we can write
A,,kei'
Afl.kfJO)fl(kt)
(17)
which is the amplitude-adjusted time-scaled form ofthe signal e' . Therefore, the scale transform of a harmonic
series is a special case of summing a set of scaled signals [see 6, p. 247]
D(c) =
(18)
where
.
—jclnt
E,1(c)
(19)
so that the only difference between the scale transform of complex exponential at the fundamental o and the harmonic series built upon that fundamental is a complex phase and amplitude term. Even worse is the fact that
, so that the harmonic series built upon o2 canjust as readily be built as a series upon o . The implications are that a scale transform not only destroys octave relations (a good thing as far as the chromagram is concemed) but it destroys all pitch class relations within the octave. That is, it eliminates chroma variation altogether!
2 a1
Generalizing the scale transform to time-scale transforms [6, Chapter 19; 7], such as
Js*(e/2t)e_1s(e2t)dt
(20)
does not eliminate this problem. Therefore, while the chromagram requires a form ofweak scaling, the scale invanance imposed by the scale transform appears to be too strong to be useful in forming the chromagram.
3.2. Relationship to Time-Frequency Transformations
Standard time-frequency transformations do not appear to be any more appropriate than the scale transform for capturing the desired properties of the chromagram. Simultaneous time and frequency translation are common goals in
the design of bilinear time-frequency distributions. To achieve them, the kernel (O, t) is designed to depend on 0
and 'r and to be independent of both t and o . Whereas time translation is a useful property for the chromagram, frequency translation is too strong and must be weakened for translates beyond the octave. Furthermore, the linear structure of frequency translation poses problems given the logarithmic nature of the semitone organization of the octave.
Within the octave, we desire translates that preserve the ratios among frequencies rather than their differences.
3.3. Chroma Representation: Non-uniform Sampling
The failure for chroma to be well-represented by either scale or frequency suggests that we consider more general sig-
nal representation techniques for designing the distribution. It is at this stage that a choice must be made about
whether we want a one-to-one mapping of a signal and its chroma representation, or whether an alternative many-toone mapping is appropriate. In the case of scale, for example, an Hermitian operator can be defined which captures
the intrinsic notions surrounding scale. As such, the transform that results is a one-to-one mapping.
640
Downloaded from SPIE Digital Library on 08 Feb 2010 to 165.124.180.206. Terms of Use: http://spiedl.org/terms
in contrast to scale, we may not be working with a concept that is amenable to unitary transformation for the case of
chroma and the chromagram. In its most basic form, the mapping from frequency to chroma is many-to-one. Given
the expansion of frequency, it follows that
f:= 2c+h
logj= c+h
logj—[1og[j =
(21)
c
where L J is the floor function. This functional form shows that once the proper equivalence class for a given frequency is found, we have lost the information about the tone height of that frequency.
The mapping in (21) is suggestive of a type ofnon-uniform sampling. Note that c ranges from 0 to I so that we have
remapped the frequency line into an interval of length one. We know from conformal maps on the complex plane that
this re-mapping can either be one-to-one, as in bilinear mappings, or many-to-one. It is clear from the notion of
octave equivalence that chroma is a many-to-one type of mapping which is similar to the mapping of the complex
plane that obtains when sampling a continuous-time signal uniformly in time. The difference is that unlike the case of
uniform sampling in time, which tiles the complex plane into equal widths oflinear frequency, chroma tiles the com-
plex plane into equal widths of logf.
In most signal processing applications, sampling requires some form ofpre-processing ofthe signal to avoid aliasing.
In the present case, however, it is exactly the aliasing of higher octave components into a base octave range that is of
interest in forming the chroma equivalence class. Therefore, no prior processing of the signal is warranted and the
aggregation function G in (3) reduces to summation.
3.4. Door #1 : Octave-aliased signals
Following the argument in the preceding section, we propose to integrate the structure of non-linear sampling (and
the aliasing that results) with time-frequency distributions to form the chromagram. We utilize the compression operator to achieve the desired non-linear sampling of the signal through aliasing. That is, we form a aliased signal Sai(t)
through the operator
Sai(t)
=r
jkln2C
s(t)
(22)
s(2kt)
In general, the boundedness of the operator is determined by the range of k in (22). Thus, a finite bound on the operator requires that k be bounded by a finite kmin and kmax
Using the Wigner distribution as a starting point, a distribution with chromagram-like properties appears as
s(t, c;h) = JSal*(t + t/2)e2
Sai(t — t/2)d'r
h h+1 1 . In
where the integration is performed with respect to frequencies over a fixed octave [2 , 2
of octave must guarantee that all scales over the range kmjn to kmax are represented.
(23)
practice, the choice
Expanding (23)
641
Downloaded from SPIE Digital Library on 08 Feb 2010 to 165.124.180.206. Terms of Use: http://spiedl.org/terms
s(t, c;h) =
S(S*(2(t
+ (t/2))))dt
(t/2))))e12 ÷h([
h
+ (2t)/2)ei2
=
m n
= Js*(2mt - (2mt)/2)S(ft + (2t)/2)e2
mn
— 5s*(2t (2t)/2)s(2t+ (2t)/2)e2
+h
d't
(24)
h
F2
$5*(2mt (2mt)/2)S(2t + (2t)/2)e j2
dt
shows that the candidate chromagram is the sum of the Wigner distributions for multiple-octave-scaled version of the
original signal (see the first summand in the final result) and the cross-Wigner distributions for all pair-wise combinations of these multiple-octave-scaled versions (see the second summand in the final result). Given the linear-scaling
properties ofthe Wigner distribution, we obtain
+h
— (2"t)/2)sJs(2't + (2't)/2)ei2
tdt =
n c+h—n
: W(2 t, 2
(25)
)
n
The second summand can also be further simplified by using the linear-scaling properties ofthe Cross-Wigner distri-
bution.
The result in (25) is important in that it reveals a potential problem with this particular candidate solution. imagine
that the signal consists of two sinusoids an octave apart in frequency each of which is modulated in frequency by the
same sinusoidal function. The (idealized) time-frequency content ofthis signal is
+
. ((2TCfmt) + 2k2C0)) +
s(t, 2c h) = S(2c 1: — sin
(2c
+h
sin((2icfmt) +
2k + 1
(26)
2C0))
According to (25), we want to re-scale both time and frequency. For example, consider the octave re-scaling that
maps the component at 2' 2C0 in (26) down to 220.
kc
c+h+1
s(t/2,2Cht) = S(2C+Ii 1—sin((2Efm (t/2))+2 2 ))+ö(2
k+1
—sin((2icfm(t/2))+2
2))
(27)
Under the aliasing step, the second term on the right hand side of(27) will add to the first term of(26) to yield
&(2
c+h
.
k
C
—sm((2EJt)+2 2 O))+(2C±hsin((2f(t/2))+2k2O))
(28)
This latter expression shows the effect ofscaling on both the frequency and the time ofthe representation. it is clearly
not the desired result for the chromagram. From a perceptual standpoint, the pitch of the signal in (26) has the same
modulation rate in time regardless of the absolute frequency at which it is played. This is not the case in (28) which
suggests that the octave mapping changes the modulation rate by a factor of 2. Therefore, the simultaneous warping
of both the frequency and time axis is an outcome we do not desire; only the frequency axis should be aliased'.
3.5. Door #2: Operator Scaling
Examining (24), we see that the scaling ofboth time and frequency arise from the scaling ofthe signals in (23). Of the
two variables in the instantaneous autocorrelation function of the signal, time and lag, the derivation shows that scaling the time variable leads to the unwanted scaling of any modulation functions along the time-frequency surface. It
also shows that time and lag do not mix in the set-up of the transform.
a. In his paper on the Melhn wavelet transform, Nelson [71 also had to deal with a similar problem of dc-phasing under scale
transformation.
642
Downloaded from SPIE Digital Library on 08 Feb 2010 to 165.124.180.206. Terms of Use: http://spiedl.org/terms
We can introduce a scaling of the lag variable alone by applying the scaling operator to the Fourier kernel eIt,
rather than to the time signals, at all appropriate lag scales. Accordingly,
s(t, c)
= fs*(t (t/2))[dkIn2Ce_J2
=
1
5s*(t_ (t/2))(
-
t)s(t +
k 12c+h(2kt)
2e
s(t+t/2)dt
e
=
Js*(t (t/2))(
=
iJJs*(t (t/2))e'2
)s(t + t/2)dt
(29)
+k
t/2)d-r
=
The resulting expression in (29) has a very simple interpretation: chromagram structure can be induced on the Wigner
distribution by aliasing the Fourier kernel onto a common chroma range. The computation reduces to performing a
weighted sum of frequency-scaled Wigner distributions and requires the calculation of a single Wigner distribution
from the signal.
3.6. Generalizations
Like the Wigner distribution, the present form ofthe chromagram can be viewed as a starting point for further manipulation by two-dimensional convolution operators, much like Cohen's class of time-frequency distributions. By lin-
earity, the summation passes through the linear transformations of the Wigner distribution to result in the
chromagram as a sum of frequency-scaled TFDs of the given waveform. In particular, we have considered both the
modal distribution [8] and the spectrogram in analyzing the chroma structures of various acoustic signals.
4. Discrete Implementation Issues
When working with sampled signals, the resulting time-frequency distribution is no longer defined on a plane but,
instead, on a regular lattice in frequency and time. The linear structure of this regular lattice raises problems in performing the octave down- and up-sampling operations required in (29). In our research, we have considered two general approaches to handling the aliasing operations.
4.1. Refining the lattice
Lattice refinement along the frequency axis requires some form of interpolation. Among those we have explored are
spline interpolants, as well as other wavelet classes, and standard variations of Fourier interpolants. In general, such
interpolants can be computationally costly, but when little else is specified about the signal class under study, these
techniques do provide the necessary substrate to alleviate many ofthe problems associated with regular lattices.
4.2. Re-localizing features in a sparse time-frequency surface
Much ofour application ofthe chromagram concerns signal sets that could be characterized as spectrally sparse. Such
signals are those that give rise to strong percepts ofpitch through a reasonably complete set of harmonics. In these
cases, the time-frequency surfaces are relatively simple in structure: the location and height of ridges in the time-frequency surface convey information about the instantaneous frequency and amplitude of the harmonics, respectively.
In addition, regions of the surface lieing outside the ridges bear relatively little information pertaining to the signals.
The result is that one may be most interested in the chrornagram of the ridge structure, rather than that of the entire
time-frequency surface. Re-localization refers to the process ofrefining the frequency and amplitudes ofthe time-frequency ridges. We utilize a local centroid and averaging technique to refme both the frequency and the amplitude of
each ridge, and then compute the chroma value for the refined frequency estimate.
A potential drawback with this approach is its reliance on the accuracy of the peak detection algorithm used in identifying the ridges. A simple alternative is to re-localize the entire lattice. While this introduces bias in the frequency
643
Downloaded from SPIE Digital Library on 08 Feb 2010 to 165.124.180.206. Terms of Use: http://spiedl.org/terms
estimates in the neighborhood of the ridge, it will not create a false peak in the aliased chroma spectrum at that time
slice. Since most interest is on concentrations of signal energy, experience suggests that the bias effects are negligible
in evaluating a signal's chroma strength.
S. EXAMPLE
We have been interested in the use of the chromagram to
aid singer's in visualizing pitch contours during singing.
Figure One illustrates the relevant features of the chro-
magram for a 5-note descending scale vocalise by a
female student soprano at the University of Michigan. in
addition to lowering the pitch for each note, the student
also transited through the five cardinal vowels. Time, for
S
S
S
SS
-55 555
----555
S
--
each panel, lies along the abscissa, whereas relative
amplitude (top panel), frequency (middle panel), and
chroma (bottom panel) lie along the ordinate.
The top panel illustrates the temporal envelope of the
musical passage, whereas the middle and bottom panels
show the spectrogram and chromagram, respectively.
The chromagram was computed directly from the spectrogram using the re-localization technique on the entire
time-frequency image and then thresholded to a 15 dB
dynamic range. in our studies, we have found that refinFigure One. Three separate visualizations of a cardinaling the instantaneous frequency based on spectrogram
vowel vocalise on a descending 5-note scale by a female
information is particularly sensitive to the choice and
student soprano are shown: time envelope (top), spectro- length of window. The present case utilized a 304-pt.
gram (middle), and chromagram (bottom).
Bartlett window with zero-padding to 512 points for an 8
kHz sampling rate. For an isolated sinusoid, this choice
of parameters yields a resolution on the order of 0.0001 Hz [9].
As can be seen, the spectrogram does a good job of showing the variations in formant location as a function of sung
vowel. The chromagram shows several ridges in parallel, corresponding to the first three or four chroma values asso-
ciated with a harmonic series [5]. As the singer descends through the five pitches, the configuration of ridges is
observed to drop, wrap-around the octave, and then continue to descend. in addition, the singer's vibrato is clearly
evident by the rapid modulation in frequency about each of the five pitches. Variations in the strength of the vibrato
are also readily observed, both as a function of time-of-pitch onset and vowel type.
6. CONCLUSIONS
We have applied the chromagram to a number of different acoustical problems. Our greatest success has been when
the signals are strongly dominated by pitch, in which case the chromagram results in a substantially less complicated
portrait of the signal. In all of these cases, we utilized an ad hoc, but effective, method for calculating the chromagram. The present paper has developed mathematical arguments for why this ad hoc method is appropriate as a representation of the chromagram.
7. ACKNOWLEDGMENTS
This research was supported by grants from the Ford Motor Company, the Office ofNaval Research (through the auspices ofthe Naval Submarine Medical Research Laboratory), and the Office ofthe Vice President for Research at the
University of Michigan (through the auspices ofthe MusEn Project).
8. REFERENCES
1. R. Shepard, "Circularity injudgements ofrelative pitch," J. Acoust. Soc. Am., 36, pp. 2346-2353, 1964.
2. R. D. Patterson, "Spiral detection ofperiodicity and the spiral form ofmusical scales," Psychology ofMusic, 14,
pp. 44-61, 1986.
3. 5. Duanmu, G. H. Wakefield, Y. Hsu, 5, Qui, R. C. Guevara, "Taiwanese Putonghua Corpus (TWPTH) Speech and
Transcripts," Linguistic Data Consortium, 1998.
644
Downloaded from SPIE Digital Library on 08 Feb 2010 to 165.124.180.206. Terms of Use: http://spiedl.org/terms
4. G. H. Wakefield, "The mathematical implications of a pulse-ribon perceptual organization of pitch," Proc. of Intl.
Comp. Music Conf, Hong Kong, August, 1996.
5. G. H. Wakefield, "Time-Pitch Representations: Acoustical Signal Processing and Auditory Representations," IEEE
intl. Symposium on lime-Frequency Time-Scale, Pittsburgh, 1998.
6. L. Cohen, lime-Frequency Analysis, Prentice Hall, Englewood Cliffs, NJ. 1995.
7. D. Nelson, "The Mellin-wavelet transform," Proc. ICASSP-95, pp. 1101-1104, Detroit, May, 1995.
8. W. J. Pielemeier and G. H. Wakefield, "A high-resolution time-fequency representation for musical instrument signals," Journal of the Acoustical Society ofAmerica, 99(4), pp. 2382-2396. 1996.
9. G. H. Wakefield, "Chromagram visualization of the singing voice," Proc. of the International Workshop On Models and Analysis of Vocal Emissions for Biomedical Applications, Firenze, Italy, September, 1999.
645
Downloaded from SPIE Digital Library on 08 Feb 2010 to 165.124.180.206. Terms of Use: http://spiedl.org/terms
Download