Beat Tracking based on Multiple-agent Architecture

From: Proceedings of the Second International Conference on Multiagent Systems. Copyright © 1996, AAAI (www.aaai.org). All rights reserved.
Beat Tracking based on Multiple-agent Architecture
A Real-time Beat Tracking System for Audio Signals w
Masataka Goto and Yoichi Muraoka
School of Science and Engineering, WasedaUniversity
3-4-10hlatbo Shinjuku-ku, Tokyo169, JAPAN.
{goto, muraoka}@muraoka.info.waseda.ac.jp
Abstract
Thispaperpresentsan applicationof multiple-agent
architecture to beat trackingfor musicala~ousticsignals. Beattracking is an importantinitial step in computer
understanding
of
musicand is useful in variousmultimediaapplications.Most
previous beat-trackingsystemsdealt with MIDIsignals and
werenot based on a multiple-agentarchitecture. Oursystemcan recognize,in real time, temporalpositions of beam
in real-worldaudiosignals that containsoundsof variousinstruments.Ourapplicationof multiple-agentarchitectureenables the systemto handleambiguous
situations in interpreting real-worldinput signals and to examine
multiplehypotheses of beat positions in parallel. Evenif someagents lose
track of the beat. other agentswill maintainthe correct hypothesis. Eachagentis able to interact withother agents to
track beats cooperatively,andself-evaluatethe reliability of
its hypothesison the basis of the currentinput situation, and
adaptto the currentsituation in orderto maintainthe correct
hypothesis.Theseagents havebeenimplemented
on different
processingelement~in a parallel computer.Ourexperimental
results showthat the systemis robust enoughto handleaudio
signals sampledfnmlcommercially
distributed compactdiscs
of popularsongs.
Introduction
Multiple-agent architectures have recently been applied in
various domains. This paper describes our application
of multiple-agent architecture to beat tracking for musical
acoustic signals. In our formulation, beat tracking means
tracking the temporalpositions of quarter notes, just as people keep time to music by hand-clapping or foot-tapping.
There are various ambiguoussituations that occur whena
system interprets real-world input audio signals like those
sampled from compact discs. Multiple-agent architecture
has the advantagesof interpreting those signals and tracking
beats in various ways, because different agents can examine
multiple hypotheses of beat positions in parallel according
to different strategies. The maincontributionof this paperis
to showthat such a multiple-agent architecture is actually
useful and effective for a practical real-world application,
namely,beat tracking.
Beat tracking is an important initial step in computeremulation of humanmusicunderstanding, since beats are fundamental to the perception of Western music. A person who
cannot completely segregate and identify every sound component can nevertheless track musical beats. It is almost im-
possible to understandmusicwithout perceiving beats, since
the beat is the basic unit of the temporalstructure of music.
Moreover,musicalbeat tracking is itself useful in various applications, such as video editing, audio editing, stage lighting control, and music-synchronized CGanimation (Goto
Muraoka1994). Wetherefore first build a computational
modelof beat perception and then extend the model, just as
a person recognizes higher-level musical events on the basis
of beats.
Various beat-tracking related systems have been undertaken in recent years (Dannenberg& Mont-Reynaudi 987;
Desain &Honing 1989; Allen & Dannenberg199{); Driesse
1991; Rosenthal 1992; Desain &Honing1994; Vercoe1994;
Large 1995). Someprevious systems (Allen & Dannenberg
199{); Rosenthal 1992) have maintained multiple hypotheses to track beats, and an earlier paper (Rosenthal, Goto,
Muraoka1994) has presented the advantages of the strategy of pursuing multiple hypotheses. Most of the systems
maintaining multiple hypotheses, however, were not based
on a mnltiple-agent architecture. Theone described in (Allen
& Dannenberg1990) examined two or three hypotheses by
beamsearch and tracked beats in real time. It dealt only with
MIDIsignals as its input and was not able to deal with audio signals played on several musical instruments, however.
Another MIDI-basedsystem (Rosenthal 1992) maintained
numberof hypotheses that were periodically ranked and selected. Those hypotheses were examinedsequentially and
the systemdid not workin real time.
Webuilt a beat tracking systemthat processes real-world
audio signals that contain the soundsof various instruments
and that recognizes the temporal positions of beats in real
time. Our system is based on multiple-agent architecture in
which multiple hypotheses are maintained by programmatic
agents using different strategies for beat-tracking. Berausc
the input signals are examinedfrom the viewpoints of these
various agents, various hypotheses can emerge. Agents that
pay attention to different frequencyranges, for example,may
track different beat positions. This multiple-agent architecture enables the systemto cope with difficult beat-tracking
situations: even if someagents lose track of beats, the systemwill track beats correctly as long as other agents maintain
the correct hypothesis.
Eachagent is capable of interaction, self-evaluation, mid
adaptation. In makinga hypothesis, the agent interacts with
other agents to track beats cooperatively. Each agent then
Goto
103
From:evaluates
Proceedings
the Second of
International
Conference
Multiagent
Copyright
© 1996,
(www.aaai.org).
rights reserved.
theofreliability
its ownhypothesis
ononthe
basis ofSystems.
because
there
is notAAAI
necessarily
a singleAllspecilic
st)und that
the current input situation, and the mostreliable h)2~othesisis
consideredthe final output, if the reliability of a hypothesis
becomeshigh enough,the agent tries to adapt to the current
situation by adjusting a parameterthat controls its strategy
in order to maintainthe correct hypothesis.
To perform this computationally intensive task in real
time. the system has been implementedon a parallel computer, the Fujitsu API000. Each agent and frequencyanalysis modulehas been implementedon a different processing element. In our experiment with audio signals sampled fronl compactdiscs, the systemcorrectly tracked beats
in 34 out of 40 popular songs that did not include drumsounds. This result showsthat our beat-tracking modelbased
on multiple-agent architecture is robust enoughto handle
real-world audio signals.
Multiple-agent
Architecture for Beat Tracking
In this section we specify the beat tracking problem that
we are dealing with and present the main difticulties of
tracking beats: ambiguityof interpretation and the ]teed for
context-dependentdecisions, difficulties which are common
to other real-world perceptual problems. Wethen describe
the multiple-agent architecture to address the beat tracking
problem,defining our agents and outlining their interaction.
Beat Tracking Problem
In our formulation, beat tracking is delined as a process that
organizes music into ’almost regularly spaced beats correspondingto quarter notes. Our beat tracking problemis thus
to obtain an appropriate sequence of beat times (temporal
positions of beats) that corresponds to input musical audio
signals (Figure 1). This sequenceof beat times is called the
quarter-note level. Wealso address the higher-level beat
tracking problem of determining whether a beat is strong
or weak(beat type) ] under the assumption that the timesignature of an input song is 4/4. This is the problem of
tracking beats at the half-note level.
Thereare various difficulties in tracking the beats in realworld musical acoustic signals. The simple technique of
peak-lindingwith a threshold is not sufficient since there are
manyenergypeaksthat are not directly related to beats. Multiple interpretations of beats are possible at any given point
directly indicates the beat positi,n; the beat is a perceptual
concept that a humanfeels in InUSiC. There are va.rions ambiguous situations, such as ones where several events obtained by frequency analysis maycorrespond it) a heat and
wheredi ft’erent inter-beat intervals (the temporaldifference
betweentwo successive beats) seem plausible. In addition.
higher-level processing using musical knowledgeis necessary for makingcontext-dependent decisions, such ~ks determining whether a beat is strong or weakand evaluating
whichis the best interpretation in an ambiguoussituation.
Our solution to the problemof handling ambiguoussituations is to maintain multiple hypotheses, each of which
correspondsto a provisional or hypothetic.’d interpretation
of the input. A real-time systemusing only a single hypt~thesis is subject to garden-patherrors. A multiple-hypothesis
systemcan pursue several paths simultaneously,and laler decide whichone wascorrect. In other words,in real-time beat
tracking these hypothesesrepresent the results of predicting
the next beat in different waysand it is impossibleto knowin
advancewhichone will be correct (becausetile future events
are not available).
Multiple-agent Architecture
To examine multiple hypt~theses in parallel, we use a
multiple-agciRarchitectttre in whichagents with different
strategies interact through cooperation and competition to
track beats (Figure 2). Several delinitions of the term agent
have been proposed (Minsky 1986: Maes 1990; SIR~ham
1993; CACM
1994; Nakatani, ()kuno, & Kawabata 1994;
ICMAS
19957ill our terminology, the tern] agent Ineans
a software compunentthat satisfies the following three reqnirements:
1. the agent interacts with other agents to perform a given
t~sk.
2. the agent evaluates its ownbehavior on the basis t)f the
input.
3. the agent adapts to the input by adjustitlg its ow,behavior.
Each agent maintains a beat-position hypothesis, which
consists of a predicted next-beat lime, its heat type (strong
IIn this paper,a strongbeatis either the lirst nr third quarler
notein a measure;a weakbeat is the secondor fourth.
predict
2
I I I I I l
Agent3 I
I I I I I .. I .., i
Agent4
I I I I . I I I t.__
Agent5 ... I
I I I
I I . I
Agent
Musical
acousUc
signals
" //’if/¢,
Beat
times
(quaaer’n°te
level)
Bsatlype
]
~haf-nole level)t~
J I
[
Strong Weak
I
Strong Weak
I
ICMAS-96
predicted
J__next-beat
time
I
Strong Weak
Figure 1: Beat tracking problem.
104
inter-beat interval
Figure 2:
agents.
¯fimc
Multiple hypotheses maintai,ed by multiple
From: Proceedings of the Second International Conference on Multiagent Systems. Copyright © 1996, AAAI (www.aaai.org). All rights reserved.
"u.ntl-1 I I I
lp"r
¯ lime
...........
,,
h,bitt
A.._._t~i~
~;~’~\
compact
disc
O ]~’Higher-level"
checkers
,J¢~,~
.,,,,,,,,
prediction
How field
audio, ignals
Figure 3: Interaction between agents through a prediction
field.
or weak),and the current inter-beat interval. In makingthe
hypothesis, the agent interacts with other agents to perform
the beat-tracking task (the lirst requirement). All agents
are groupedinto pairs that havedifferent strategies for beat
tracking. Each agent in the pair examinesthe same interbeat interval using the samefrequency-analysis results. To
predict the next beat times cooperatively, one agent interacts
with the other agent in the same pair through a prediction
field. Theprediction field is an expectancycurve- that represents whenthe next beat is expectedto occur (Figure 3). The
height of each local peakin the prediction tield can be interpreted as the next beat-position possibility. The two agents
interact with each other by inhibiting the prediction field in
the other agent. Thebeat time of each hypothesisinhibits the
temporaily corresponding neighborhoodin the other’s field
(Figure 3). This enables one agent to track the correct beats
even if the other agent tracks the middleof the two successive correct beats (whichcompensatesfor one of the typical
tracking errors).
Eachagent is able to evaluate its ownhypothesis, using
musical knowledge,according to the input acoustic signals
(the secondrequirement). Wecall the quantitative result
this self-evaluationthe reliability of the hypothesis.Thefinal
beat-tracking result is determinedon the basis of the most
reliable hypothesis that is selected from the hypothesesof
all agents.
Each agent aiso adapts to the current input by adjusting
its ownstrategy parameter (the third requirement). If the
reliability of a hypothesis becomeshigh enough, the agent
tunes a parameterto narrowthe range of possible inter-beat
intervals so that it examinesonly a neighborhoodof the current appropriate one. This enables the agent to maintain the
hypothesisthat has the inter-beat intervai appropriate to the
current input.
System Description
3The system for musical audio signals without drum-sounds
assumes that the time-signature of an input song is 4/4
and that its tempo is constrained to be between 61 M.M.
2Oilier systems (Desain1992; Desain&Honing1994:Vercoe
1994)haveused a similar conceptof expectancycurve for pz~dieting future events, but not as a meansfor managing
interaction
amongagents.
3A detailed description of our beat-trackingsystemfor audio
signals that includedrura-soundsis presentedin tGoto&Muraoka
1995a:1995b).
-
~~---"¢"
Onset-time finders
Onset-time vectorizere
Manager
Agents
.....
"~ ~ |--I~re(l~n~-----~~-~’~._b~eeat
~. Nu uolwerslon ~.~
~B..:. ~lleall Preolcll0fl
~"
~ r S
14keIMlyJll :k r S
Inf0rmdon~
~ .r.~...z..:..
~ r ~ el41 lalllwPJ//I
~"
Figure 4: Processing model.
~.~k;slcai’~;;di~o~i~nals-"i
Time-slgnaturs: 414
................ ~.................. Tempo: 61-120 M.M.
I
~
CorlversJon
!
l l~.t Fourier Tr.,form I Frequency
Analysis[Extracting
onset
colnponentsJ
i
°i
I
i
Hl~her.level
6heckers
onset-time
vectore
i
Beat
Manager
Prediction~
most
reliable
hypothesis
! BITransmlssion !
...............
(.
J. ....
Beat ,n,orma~on "~i Be~t
,iRma. tempo
Best type,
Current
Figure5: Overview
of o~beat trackingsystem.
(M~ilzcl’s Metronome: the number of quarter notes per
minute) and 120 M.M., and is roughly constant. The emphasis in our systemis on 1inding the temporalpositions of
quarter notes in audio signals rather than on tracking tempo
changes. The system maintains, as the real-time output, a
description cailed beat information(BI) that consists of the
beat time, its beat type, and the current tempo.
Figure 4 is a sketch of the processing modelor our beat
tracking system, and Figure 5 showsan overviewof the system. The two main stages of processing are Frequem’yAnalysis, in whichseveral cues used by agents are detected, and
Beat Prediction, in whichmultiple hypothesesof beat positions are examinedby multiple agents. Since accurate onset
Goto
lO5
From: Proceedings of the Second International Conference on Multiagent Systems. Copyright © 1996, AAAI (www.aaai.org). All rights reserved.
times arc indispensable for tracking beats, in the Frequency
Analysis stage, the systemuses multiple onset-time finders
that detect onset times in several different frequencyranges.
Thoseresults are transformed into vectorial representation
(called onset-time vectors) by several onset-t#ne vectorizers.
In the Beat Prediction stage, the system managesmultiple
agents that, accordingto different strategies, makeparallel
hypotheses based on these onset-time vectors. Each agent
lirst calculates the inter-beat interval and predicts the next
beat time; it then infers the beat type by communicating
with
a higher-level checker(described later), and evaluates the reliability of its ownhypothesis. The managergathers all hypotheses and then determines the final output on the basis
of the mostreliable one. Finally, the systemtransmits BI to
other application programsvia a computernetwork.
The following describe the main stages of Frequency
Analysis and Beat Prediction.
Frequency Analysis
In the Frequeno,Analysis stage, the frequency spectrumand
several sequences of n-dimensional onset-time vectors are
obtained for later processing (Figure 6). The full frequency
band is split into several frequencyranges, and each dimension of the onset-timevectors correspondsto a different frequencyrange. This representation makesit possible to consider onset times of "all the frequencyrangesat the sametime.
Eachsequenceof onset-time vectors is obtained using a different set of weightsfor frequencyranges. Onesequence,t’tw
example, focuses on middle frequency ranges, and another
sequence focuses on low frequency ranges.
Fast Fourier Transform (FFT) The frequency spectrum
(the power spectrum) is calculated with the FFTusing the
Hanningwindow.Each time the FFTis applied to the digitized audio signal, the window
is shifted to the next frame.
Figure 6: Anexampleof frequencyspectrumand an onsettime vector sequence.
106
ICMAS-96
In our current implementation,the input signal is digitized
at 16bit/22.05kHz,and two kinds of FFTare calculated. ()ne
FFT,for extracting onset componentsin the FrequencyAnalysis stage, is calculated with a windowsize of 1024samples
(46.44 ms), and the window
is shifted by 256 samples( l 1.61
ms). The frequency resolution is consequently 21.53 Hz and
the time resolution (l frame-time) is I 1.61 ms. The frametime is the unit of timeused in our system,trod the term time
in this paper is defined as the time measuredin traits of the
frame-time. The other FF]’, tor examiningchord changes in
the Beat Prediction stage, is simultaneouslycalcuhttedin attdid down-sampledat 16bit/11.025kHzwith a windowsize of
1024 samples (92.88 ms), and the windowis shifted by 128
samples(11.61 ms). Tile frequencyaa~d time resolution are
consequently 10.77 Hz and 1 frame-time.
Extracting Onset Components Frequency components
whosepowerhas been rapidly increasing :tre extracted as
onset components. The onset componentsand their degree
of onset (rapidity of increase in power)are obtainedfrom the
frequencyspectrumby a process that takes into account the
power present in nearby time-frequency regions. Moredetails on the methodof extracting onset componentscan be
found in (Goto&Muraoka1995a).
Onset-time Finders Multiple onset-time linders (seven in
our current implementation)detect onset times in several diflerent frequency ranges (0-125 Hz. 125-250Hz, 250-500Hz,
5OOHz-IkHz, 1-2 kHz, 2-6 kHz, and 6-11 kHz). Each onset
time is given by the peak time fimndby peak-pickingin I)(t)
alongthe time axis, whereD(I) ~"~.t d(t f) . and ,l (I, f l is
the degree of onset of frequency f at time 1. Limiting the
range of frequency for the summationof D(t) makesit possible to lind onset times in the different frequencyranges.
Onset-time Vectorizers Each onset-time vcctorizcr transforms the results of all onset-time finders into sequencesof
onset-time vectors: the sameonset times in all the frequency
ranges are put together into a vector. In the current system.
three vectorizers transform onset times from seven Jinders
into three sequences of seven-dimensionalonset-time vectors with the different sets of frequencyweights(l\)cusing
on all/low/middle frequencyranges). Ti)ese results arc sent
to agents in the Beat Predictionstage¯
Beat Predicthm
Multiple agents interpret the sequencesof onset-timevectors
according to different strategies and maintaintheir ownhypotheses. Musical knowledgeis necessary to determinc the
beat type (strong or weak) and to evaluate which hypothesis is best. For the audit) signals without drum-sounds,the
system utilizes the following musical knowledge:
1. Soundsare likely to occur onbeats. In other words, the
correct beat times tend to coincide with onset times.
2. Chordsare morelikely to changeat the beginningt)l" incasures than at other positions.
3. Chordsare morelikely to changeon beats (quarter-notes)
than on other positions betweentw(~ successive correct
beats.
From: Proceedings
of the Second ,’~
International
Conference on Multiagent Systems. Copyright © 1996, AAAI (www.aaai.org). All rights reserved.
Onset-time
...............................
Parameter
vectodzers/
1
¯ I"g~n~I~t~,J’. ........
Agents.,.../..::::[:!/tii~::’~;i~: :’ ] ......
f’
’k.]~,.,z::’~:"::~:.’"’-:
Higher-level /
rParameters
i i i
check~r.s
........... ~
/ rre~uency
ro~.sl~p. ~I i i i
I r
l I
--t
i ~...,.’
"l~I’[ ;,....:1
~t~-b,~,,,,a,~m/
L ~e~.tp~ks,t,,~anJ,....,’V"
J-’-, ....
v " Hypothems"
Hypothesis
.......
Nextbeattime i Next b~ t~e
:’
i
i
I tn~,r-beain~rwtJ.....b’[
Zete~e~’in~r,~
..... :
Figure 7: Relations betweenonset-time vectorizers, agents,
and higher-level checkers.
To utilize the second and third kinds of knowledge, each
agent communicates with a corresponding higher-level
checker, whichis a moduleto provide higher-level information, such as the results of examiningthe possibility of chord
changesaccording to the current hypothesis (Figure 7). The
agent utilizes this informationto determinethe beat type and
to evaluate the reliability of the hypothesis.
Eachagent has four parameters that determine its strategy for makingthe hypothesis(Figure 7), and the settings
these parametersvary from agent to agent. The lirst parameter, frequency focus ~.’pe, determineswhichvectorizer the
agent receives onset-time vectors from. This value is chosen from amongtype-all, type-low, and type-middle, respectively correspondingto vectorizers focusing on all frequency
ranges, low frequency ranges, and middle frequency ranges.
The second parameter, autocorrelation period, determines
the windowsize for calculating vectorial autocorrelation of
the sequence of onset-time vectors to determine the interbeat interval. Thegreater this value, the older the onset-time
itfformation considered. The third parameter, inter-beat interval range, controls the range of possible inter-beat intervals. Asdescribedlater, this limits the range of selectit)g
peakin the result of the vectorial autocorrelation. The fourth
parameter,initial peakselection, takes a value of either primary or secondary. Whenthe value is primary, the largest
peakin the prediction field is initially selected, and the peak
is considered as the next beat time; whenthe value is secondar).,, the secondlargest peak is selected. This helps to
obtain a variety of hypotheses.
In our current implementation there are twelve agents
groupedinto six agent-pairs, and twelve higher-level checkers corresponding
to these agents. Initial settings of the strategy parametersare listed in Table 1. As explainedin Section
Multiple-agent ArcMtecture,the parameterinter-beat interval range is adjusted as the processinggoes on.
The following sections describe the formation and mat)agement of hypotheses. First, each agent determines the
inter-beat interval using autocorrelation; it then interacts
with its paired agent through the prediction field that is
Table1: Initial settings of the strategy parameters.
pair
frequency
autointer-beat
initial
-agent focus type correlation
interval
peak
period
range
selection
1-I
type-all
500f.t.
43-85f.t.
prmaary
1-2
type-all
500f.t.
43-85f.t.
secondary
2-1
type-all
1000f.t.
43-85f.t.
prmaary
2-2
type-all
1000f.t.
43-85f.t.
secondary
3-1
type-low
500f.t.
43-85f.t.
prmaary
3-2
type-low
500f.t.
43-85f.t.
secondary
4-1
type-low
I(X)O
f.t.
43-85f.t.
primary
4-2
type-low
1000f.t.
43-85f.t.
secondary
5-1
type-middle 500f.t.
43-85f.t.
prtmary
5-2
type-middle 50()f.t.
43-85
l’.t.
secondary
6-1
type-middle 1000f.t.
43-85f.t.
primary
6-2
type-middle 1000f.t.
43-85f.t.
secondary
""Lt.’" is the abbreviationof frame-time
¢ 11.61ms).
formed using cross-correlation, and predicts the next beat
time. Second, the agent communicateswith the higher-level
checker to infer the beat type and evaluates its ownreliability. The checker examinespossibilities of chord changes
by analyzing the frequencyspectrumon the basis of the current hypothesis received from the agent. Finally, the manager gathers all the hypotheses,and the most reliable one is
considered as the output.
Beat-predicting Agents In our formulation, beats are
characterized by twoproperties: period (inter-beat interval)
and phase. Thephaseof a beat is the beat position relative to
a reference point, usually the previous beat time. Wemeasure phase in radians; for a quarter-note beat, for example,
an eighth-note displacementct~rresponds to a phase-shift of
r radians.
Eachagent first determinesthe current inter-beat interval (period) (Figure 8). The agent receives the sequence
onset-timevectors and calculates their vectorial autocorrelation a. The windowedand normalized vectorial aumcorrelation tYmctionAe(r) is dclined as
E[=~-,..(~(t)-a’(t
r)).,,,(~.AcCr)= ~[=,.-w
(6(t~.
~(t))- t~:(~.
- t~
(l)
where~(t) is the n-dimensionalonset-time vector at time
c is the current time, and Wis the strategy parameterautocorrelation period. The windowfunction w(/)is given
l
u, Ct) = 1.0 - 0.5~.
(2)
The inter-beat interval is given by the r with the maximum
height in At(r) within the range limited by fl~e paranmter
inter-beat interval range.
To determinethe beat phase, the agent then formsthe prediction field (Figure 8). The prediction field is the result
of calculating the cross-correlation function betweenthe sequence of the onset-time vectors and the sequence of beat
4Thepaper(Vercoe1994)also proposedusing.’~ variant of autocorrelationfor rhytlmaicanalysis.
Goto
107
fieldon Multiagent Systems. Copyright © 1996, AAAI (www.aaai.org). All rights reserved.
From: Proceedings ofInter-beat
the Second
International
Conference
interval
~ prediction
(by cross-correlation)
(by autocorrelatJon)
IkHzp .---,~-,. : ......." : : ""’"": " ¯ ....
i- : " ’ "
:
¯
:: - : i. - :
" : : i-- - - :
’ ~ "- - ¯ :
":
- ;- :
.- "
" - ; " :!
e
"~.
~ -- " --. ’ : ~- ~-! ¯~..-=.__~-I,p:. _~L =;" =-’---~1--,
q
~’=-~’,=’- ::,.--:-
Quarter-note
chordchange
I
possibility .
i
I
01]Z~-,’
howcoincide
StrongI~ WeakI~ StrongI~ WeakI~ Strong
Eighth-note
~ time
chord change I
possibility
:
’~...L..:
][i ’l’O~
0.0
........ i...[’.i’"i
I i I
~
:
fls
OS
~,. eighth-notedisplacement
positions
:
¯ ............. ~’..r-..’.l-’..dr.-L...~.l.~".....r--:
F.s
1Os --
"Li .i.ri.
; ii , ,,,,,,i t I!,i]i
,’~s
", Time
., "[’in p.=
Ills
(a) Examiningquarter-note chord changepossibility
Figure 8: Predicting the next beat.
times whoseinterval is the inter-beat interval. As mentioned
in Section Multiple-agentArchitecture, the two agcnLsin the
samepair interact with each other by inhibiting the prediction field in the other agent. Eachlocal peakin the prediction
field is consideredas a possible beat phase. Whenthe reliability of a hypothesisis low, the agent initially selecLs the
peakin the prediction tield accordingto the parameterinitial
peakselection, and then tries to pursue the pe’a.k equivalent
to the previously selected one. This calculation corresponds
to evaluatingall possibilities of the beat phaseunderthe current inter-beat interval. The next beat time is thus predicted
on the basis of the inter-beat interval and the current beat
phase.
The agent receives the two kinds of possibilities of chord
changes,at the quarter-notelevel and at the eighth-notelevel,
by conununicating with the higher-level checker. Wecall
the former the quarter-note chord changepossibi/i~." and the
latter the eighth-note chord changepossibility. The quarternote (eighth-note) chord changepossibility represents how
chord is likely to changeon each quarter-note (eighth-note)
position underthe current hypothesis.
To infer the beat type, we use the second kind of musical
knowledge,width meansthat the quarter-note chord change
possibility is higher on a strong beat than on a weakbeat.
If the quarter-note chord changepossibility is high enough,
its time is consideredto indicate the position of the strong
beat. The tbllowing beat type is then determinedunder the
assumptionthat strong and weakbeats alternate (Figure 8).
Theagent finally evaluates the reliability of its ownhypothesis by using the first and third kinds of musical knowledge. Accordingto the first kind, the reliability is determined
according to howthe next beat time predicted on the basis
of the onset times coincides with the time extrapolated from
the past two beat times (Figure 8). If they coincide, the reliability is increased; otherwise,the reliability is decreased.
Accordingto the third kind of knowledge,if the eighth-note
chord change possibility is higher on beats than on eighthnote displacementpositions, the reliability is increased; otherwise, the reliability is decreased.
Higher-level Checkers For the audio signals without
drum-sounds,each higher-level checker examinestwo kinds
or chordchangepossibilities accordingto the hypothesesre108
ICMAS-96
lkI-Tz.],z"-",~’,~’~"!
{’.,’"’,,’":’:" ??[’[’-’"[’"’i"ii" ~’~’£’-’.~’;’~’’]~:!l’"
-~::_.
Y ~,---.;-PP"
’,¯.’--pt--"
,~’::.: " ~..’-.’-.~---:
["-:
i_~;.;-=-=--~¯
~" :
~: .,-~.,;.~ .....
;-;-k~-~¯ .."~’.’’:
q
:-- -.,-P~-P.r.; -.,-r : ¯ ._L~..p,"-~=e....... --P -..,-~ . . : :
;---P~-~’-"::’:: ’.-I ’ - ~-;’~’--P-:’-----" -" -=..’-;---P:--:
--’=’::-| " .’-~:
:
-:__-~--pp-?-y-,--r
::=:=-~-;~.~
,_’:~..=:~...=~=--.--~;e=-.,-[
_=._-..=~=~.=.
r .--.2_e~..-F=..__-:
i
~[-;--, ~--;--.-r,---~---:
---r -~,~--~’~’’’,~,~ IF-- :
O~
l°]i"[[li:
(Is
.5~
71[iT "r " i :: ’: .....
,"is
ills
.........
;, Time
~~ !:.~ ....
i(1~
- Till ll’.
(b) Examiningeighth-mite chord change possibility
Figure 9: Examplesof peaks in sliced frequency spectrum
and chord changepossibility.
ccived from the corresponding agent.
The checkerfirst slices the frequencyspectruminto strips
at the quarter-note times (beat times) ft~r examimngthe
quarter-note chord change possibility, and slices at the
eighth-note times interpolated trom beat times for examining the eighth-note chord changepossibility (Figure 9). The
checker then finds peaks along the frequencyaxis in a histogram summed
up along the tilne axis in each strip. These
peaks can be considered as the dominanttones" pitches in
each strip. Somepeaks maybe componentsof a chord, end
others maybe componentsof a melody. Our current implementationconsiders only peaks whosefrequencyis less than
! kHz.
The checker evaluates the chord change possibilities by
comparingthese peaks between adjacent strips. The more
and the louder the peaks occur comparedwith the prcvitins strip, the higher the chord change possibilities. For
the quarter-note (eighth-note) chord changepossibility, the
checkercon]paresthe strips whoseperiod correspondsto flic
quarter-note (eighth-note) duration underthe curre,I hypothesis.
Figure 9 showsexamplesof two kinds ill chord oh:rage
possibilities. The horizontal lines aboverepresent peaksin
each strip’s histogram. TheOlick vertical lines belowrepresent the chord changepossibility. The beginningt)f measure
comesat every four quarter-notes from the extreme left in
ta"), and the beat comesat every twoeighth-notes lrt~ln the
extremeleft in (b).
From: Proceedings of the Second International Conference on Multiagent Systems. Copyright © 1996, AAAI (www.aaai.org). All rights reserved.
Hypotheses Manager The manager classifies all agentgenerated hypotheses into groups according to beat time and
inter-beat interval. Eachgroup has an overall reliability
given by the sumof the reliabilities of the group’s hypotheses. The managerthen selects the dominantgroup that has
the highest reliability. Since a wronggroupcould be selected
if temporarily unstable beat times split the appropriate dominant group, the mauagerrepeats groupingand selecting three
times while narrowing the allowable margin of beat times
for becomingthe samegroup. The reliable hypothesis in the
most dominantgroup is thus selected as the output and sent
to the BI Transmissionstage.
The managerupdates the beat type in the output using only
the beat type that was labeled whenthe quarter-note chord
change possibility was high comparedwith the recent maximumpossibility. Whenthe possibility was not high enough,
the updated beat type is determinedfrom the previous reliable beat type based on the alternation of strong and weak
beats. This enables the systemto disregard an incorrect beat
type that is causedby a local irregularity of chordchanges.
Implementation on a Parallel Computer
Parallel processingprovides a practical and feasible solution
to the problem of performing a computationally intensive
task, such as processing and understanding complexaudio
signals, in real time. Our system has been implementedon
a distributed-memoD,parallel computer, the Fujitsu AP1000
that consists of 64 processingelements(Ishihataet al. 1991).
A different element or group of elements is assigned to each
module, such as FFT, the onset-time finder, the onset-time
vectorizer, the agent, the higher-level checker, and the manager. These modules run concurrently and communicate
with others by passing messages between processing elements.
Weuse four kinds of parallelizing techniques in order to
execute the heterogeneous processes simultaneously (Goto
&Muraoka1996). The processes are first pipelined, and then
each stage of the pipeline is implementedwith data/control
parallel processing, pipeline processing, and distributed cooperative processing. This implementationmakesit possible
to analyze audio signals in various ways and to managemultiple agents in real time.
Experiments and Results
Wetested the system for audio without drum-soundson 40
songs performedby 28 artists. The initial one or two minutes of those songs were used as the inputs. The inputs were
monaural audio signals sampled from commercial compact
discs of thepopular musicgenre. Their tempi ranged from 62
M.M.to 116 M.M.and were roughly constant. It is usually
moredifficult to track beats in songs without drum-sounds
than in songs with drum-sounds,because they tend to have
fewer sounds which fall on the beat and musical knowledge
is difficult to applyin general.
In our experiment,the systemcorrectly tracked beats (i.e.,
obtained the beat time and type) in 34 out of the 40 songsin
real time5. In each song wherethe beat was eventually determinedcorrectly, the systeminitially had trouble determining
the beat type, even thoughthe beat time was correct. Within
at most fifteen measuresof the beginningof the song, however, both the beat time and type had been determinedcorrectly. In most of the mistaken songs, beat times were not
obtained correctly since onset times were very few or the
tempt) fluctuated temporarily. In other songs, the beat type
was not determinedcorrectly becauseof irregularity of chord
changes.
Theseresults showthat the systemis robust enoughto deal
with re’d-world musical signals. Wehave also developedan
application with the systemthat displays a computergraphics dancer whosemotion changes with musical beats in real
time (Goto & Muraoka1994). This application has shown
that our systemis also useful in multimediaapplications in
whichhuman-likehearing ability is desirable.
Discussion
Our goal is to build a system that can understand musical
audio signals in a human-like fashion. Webelieve that an
importantinitial step is to build a systemwhich,even in its
preliminary implementation,can deal with real-world acoustic signals like those sampled from compact discs. Most
previous beat tracking systems had great difficulty working in real-world acoustic environments, however. Most of
these systems (Dannenberg & Mont-Reynaud1987; Desain
& Honing 1989; Allen & Dannenberg1990; Driesse 1991;
Rosenthal 1992) have dealt with MIDIsignals as their input. Sinceit is quite difficult to obtain completeMIDIrepresentations from audit) data, MIDI-basedsystems are limited
in their application. Althoughsomesystems (Schloss 1985;
Katayoseet al. 1989) dealt with audio signals, they had difficulty processing music played on ensembles containing a
variety of instruments and did not workin real time.
Ourstrategy of first building a real-time systemthat works
in real-world complex environments and then upgrading
the ability of the systemis related to the scaling-up problem (Kitano 1993) in the domainof artiticial intelligence
(Figure 10). As Hiroaki Kitano stated:
experiences in expert systems, machinetranslation systems, and other knowledge-basedsystems indicate that
5Ourother beat-trackingsystemtot audiosignals that include
drum-sounds,
whichis basedon a similar multiple-agentarchitecture, correctlytrackedbeaL~
in 42out of the 44 songsthat included
drum-sounds(Gore&Muraoka1995b).
Task
complexity
Scalability
Toy ,~ Intelligent
system
system
t
Systems
that
pay-off
I
~- Useful system
Domain
size
(closeness
to the real-world}
Figure 10: Scaling-up problem (Kitano 1993).
Goto
109
From: Proceedings of the Second International Conference on Multiagent Systems. Copyright © 1996, AAAI (www.aaai.org). All rights reserved.
scaling up is extremely difficult for manyof the prototypes. (Kitano 1993)
In other words, it is hard to scale-up a system whosepreliminary implementation works only in laboratory environments.
Wethink that our strategy addresses this issue and that the
application of multiple-agent architecture makes the system
robust enough to work in real-world environments.
Some researchers might regard as agents several modules
in our system, such as the onset-time finders, the onset-time
vectorizers, the higher-level checkers, attd the manager. In
our terminology, however, we define the term agent as a software componentof distributed artificial intelligence that satisfies the three requirements presented in Section Multipleagent Arcizitecture. Wetherefore do not consider those modules agents: they are simply concurrent objects.
Conclusion
We have presented a multiple-agent architecture
for beat
tracking and have described the configuration and implementation of our real-time beat tracking system. Our system tracks beats in audio signals containing sounds of various instruments and reports beat information in time to input
music. The experimental results show that the system is robust enough to handle real-world audio signals sampled frotn
compact discs of popular music.
The system manages multiple agents that track beats according to different strategies in order to examine multiple
hypotheses in parallel. This enables the system to l’ollow
beats without losing track of them, even if some hypotheses
becomewrong. Each agent cat) interact with other agents to
track beats cooperatively and can evaluate its own hypothesis according to musical knowledge. Each agent also can
adapt to the current input by adjusting its ownstrategy. These
abilities make it possible for the system to handle ambiguous
situations by maintaining various hypotheses, and they make
the system robust and stable.
We plan to upgrade the system to make use of other
higher-level musical structure, and to generalize to other musical genres. Future work will include application of the
multiple-agent architecture to other perceptual problems and
will also include a study of more sophisticated interaction
among agents and more dynamic multiple-agent architecturc
in which the total numberof agents is not fixed.
Acknowledgments
Wethank David Rosenthal for his helpful comments on earlier drafts of this paper. Wealso thank Fujitsu Laboratories
Ltd. h)r use of the APIO(X).
References
Allen, P. E.. and l)annenberg. R. B. 1990. Tracking musical beats
in real time. In Prec. of the 1990 Intl. C’omptgerMusic Confi.
140-143.
CACM.
1994. Special issue on intelligent agents. Communications
o.fthe ACM37(7): 18-147.
l)annenberg, R. B.. and Mont-Reynaud,B. 1987. Following an
improvisation in real time. In i)roc, of the 1987 Intl. Computer
Music Conf., 241 - 248.
110
ICMAS-96
l)esain, P.. andHoning,H. ! 989. Thequantization t)f musicaltime:
A connectionist approach. ComputerMusic Journol 13(3):56- 66.
Desain, P., and Honing,H. 1994. Advancedissues in beat induction
modeling:syncopation,tempoand timing. In Prec. if the 1994hltl.
ComputerMtt~ic Conf.. 92-94.
Desalt. P. 1992. Can computermusic benelit from cognitive models of rhythm perception? In Pnw. of the 1992 Intl. Computer
Mr(sic Cotg, 42-45.
Driesse, A. 1991. Real-time tempotracking using rules to :malyze
rhythmicqualities. In Proc. if the 1991Intl. (hmtputerMusicCm’ll.
578-581.
Goto, M.. aml Muraoka,Y. 1994. A beat tracking systemfi)r acoustic signals of music. In Proc. t!f the SecondACMhztl. Confi on
Multimedia. 365-372.
Goto, M., and Muraoka,Y. 1995a. Musicundel~landingat the beat
level - re’d-time beat tracking for audio signals -. In Working
Notes of the IJCAI-95 Workshopon ComputationalAuditor) So’erie
Analyds. 68-75.
(;oto, M.. and Muraoka.Y. 1995b. A real-time beat tracking system
for audiosignals. In Pmc’.of the 1995bztl. Comptttc,rMusicCt,t~i.
171-174.
(loto. M.. and Mur.’mka. Y. 1996. Parallel implement:Riouof
beat tracking system- real-time musical inlk~rmationprocessing on
A P 1000- t hz Japanese).Transactionst f h!/b (marionPrt,¢’cx ring
Society of Japm, 37(7): 1460-1468.
ICMAS.
1995. Proc.. First httl. Cont. on Multi-Agent Systems. The
AAAIPress / The MITPress.
lshi.hata. H.; Hofie, T.: Inano, S.; Shimizu.T.; and Kato. S. 1991.
Anarchitecture of highly parallel computerAP1001).In lt’J-E !’o..
cific Rbn Conf. on Communications.
Computers.Signal Pn,, "essitzg.
13-16.
Katayose, H.; Kato, H.; Imai. M.: and Inokuchi, S. 1989. An
approachto an artificial music expert. In Proc. (!f the 1989Intl.
Computer Mus&Con.[:. 139-146.
Kitano. H. 1993. Challengesof massivepm’allelism. In Prc~,c. t,.]
IJCAI-93, 813-834.
Large. E. W.1995. Beat tracking with a nonlinear oscilhttcn’. In
Woddng
Notes ~f the IJC, tl-95 Workshopon Art~/t,’ial bzlelligence
and Music. 24-31.
Maes, P., ed. 1990. Designing Atttonomous AgettlS: Theory tuzd
Practice from Biology to Engineering mul Back. The MITPress.
Minsky, M. 1986. The Society of Mind. Simon&Schuster. Inc.
Nakatani, T.: Okuno. H. G.: and Kawabata. T. 1994. Auditory
streamseg/’egationin auditory sceneanalysis. In Proc.(,./A.4AI-94.
100-1(17.
Rosenthal. D.; Goto. M.; and Muraoka.Y. 1994. Rhythmtracking
using multiple hypotheses. In Proc. ~" the 1994 Intl. Computer
Music Conf.. 85-87.
Rosenthal. I). 1992. Mm’hincRhxthut: (’olnpttter lZ.mulation of
HumanRhythmPerception. l)h.D, i)isserlalion. Ma,;sachusetts Institute of Technology.
Schloss. W.A. 1985. OnThe Atttomtttic" l’rans,riptiof (?/l’ercussire Music- FromAc’uttsti," S(gnal to High-lx, rd Analysis. Ph.I ).
Dissertation, CCRMA.
Stanford University.
Shoham.Y. 1993. Agent-orientedpta)gramming.Art(!i(ial h~tclli.
gence 60( 1):51-92.
Vercoe, B. 1994. Perceptually-based musicpattern recognititm mid
response.In l)roc, of the "l’hild Intl. C’ol!llfi,r tit(" I)t’rc’t’ptimz
Cognition of Music, 59-60.