From: Proceedings of the Second International Conference on Multiagent Systems. Copyright © 1996, AAAI (www.aaai.org). All rights reserved. Beat Tracking based on Multiple-agent Architecture A Real-time Beat Tracking System for Audio Signals w Masataka Goto and Yoichi Muraoka School of Science and Engineering, WasedaUniversity 3-4-10hlatbo Shinjuku-ku, Tokyo169, JAPAN. {goto, muraoka}@muraoka.info.waseda.ac.jp Abstract Thispaperpresentsan applicationof multiple-agent architecture to beat trackingfor musicala~ousticsignals. Beattracking is an importantinitial step in computer understanding of musicand is useful in variousmultimediaapplications.Most previous beat-trackingsystemsdealt with MIDIsignals and werenot based on a multiple-agentarchitecture. Oursystemcan recognize,in real time, temporalpositions of beam in real-worldaudiosignals that containsoundsof variousinstruments.Ourapplicationof multiple-agentarchitectureenables the systemto handleambiguous situations in interpreting real-worldinput signals and to examine multiplehypotheses of beat positions in parallel. Evenif someagents lose track of the beat. other agentswill maintainthe correct hypothesis. Eachagentis able to interact withother agents to track beats cooperatively,andself-evaluatethe reliability of its hypothesison the basis of the currentinput situation, and adaptto the currentsituation in orderto maintainthe correct hypothesis.Theseagents havebeenimplemented on different processingelement~in a parallel computer.Ourexperimental results showthat the systemis robust enoughto handleaudio signals sampledfnmlcommercially distributed compactdiscs of popularsongs. Introduction Multiple-agent architectures have recently been applied in various domains. This paper describes our application of multiple-agent architecture to beat tracking for musical acoustic signals. In our formulation, beat tracking means tracking the temporalpositions of quarter notes, just as people keep time to music by hand-clapping or foot-tapping. There are various ambiguoussituations that occur whena system interprets real-world input audio signals like those sampled from compact discs. Multiple-agent architecture has the advantagesof interpreting those signals and tracking beats in various ways, because different agents can examine multiple hypotheses of beat positions in parallel according to different strategies. The maincontributionof this paperis to showthat such a multiple-agent architecture is actually useful and effective for a practical real-world application, namely,beat tracking. Beat tracking is an important initial step in computeremulation of humanmusicunderstanding, since beats are fundamental to the perception of Western music. A person who cannot completely segregate and identify every sound component can nevertheless track musical beats. It is almost im- possible to understandmusicwithout perceiving beats, since the beat is the basic unit of the temporalstructure of music. Moreover,musicalbeat tracking is itself useful in various applications, such as video editing, audio editing, stage lighting control, and music-synchronized CGanimation (Goto Muraoka1994). Wetherefore first build a computational modelof beat perception and then extend the model, just as a person recognizes higher-level musical events on the basis of beats. Various beat-tracking related systems have been undertaken in recent years (Dannenberg& Mont-Reynaudi 987; Desain &Honing 1989; Allen & Dannenberg199{); Driesse 1991; Rosenthal 1992; Desain &Honing1994; Vercoe1994; Large 1995). Someprevious systems (Allen & Dannenberg 199{); Rosenthal 1992) have maintained multiple hypotheses to track beats, and an earlier paper (Rosenthal, Goto, Muraoka1994) has presented the advantages of the strategy of pursuing multiple hypotheses. Most of the systems maintaining multiple hypotheses, however, were not based on a mnltiple-agent architecture. Theone described in (Allen & Dannenberg1990) examined two or three hypotheses by beamsearch and tracked beats in real time. It dealt only with MIDIsignals as its input and was not able to deal with audio signals played on several musical instruments, however. Another MIDI-basedsystem (Rosenthal 1992) maintained numberof hypotheses that were periodically ranked and selected. Those hypotheses were examinedsequentially and the systemdid not workin real time. Webuilt a beat tracking systemthat processes real-world audio signals that contain the soundsof various instruments and that recognizes the temporal positions of beats in real time. Our system is based on multiple-agent architecture in which multiple hypotheses are maintained by programmatic agents using different strategies for beat-tracking. Berausc the input signals are examinedfrom the viewpoints of these various agents, various hypotheses can emerge. Agents that pay attention to different frequencyranges, for example,may track different beat positions. This multiple-agent architecture enables the systemto cope with difficult beat-tracking situations: even if someagents lose track of beats, the systemwill track beats correctly as long as other agents maintain the correct hypothesis. Eachagent is capable of interaction, self-evaluation, mid adaptation. In makinga hypothesis, the agent interacts with other agents to track beats cooperatively. Each agent then Goto 103 From:evaluates Proceedings the Second of International Conference Multiagent Copyright © 1996, (www.aaai.org). rights reserved. theofreliability its ownhypothesis ononthe basis ofSystems. because there is notAAAI necessarily a singleAllspecilic st)und that the current input situation, and the mostreliable h)2~othesisis consideredthe final output, if the reliability of a hypothesis becomeshigh enough,the agent tries to adapt to the current situation by adjusting a parameterthat controls its strategy in order to maintainthe correct hypothesis. To perform this computationally intensive task in real time. the system has been implementedon a parallel computer, the Fujitsu API000. Each agent and frequencyanalysis modulehas been implementedon a different processing element. In our experiment with audio signals sampled fronl compactdiscs, the systemcorrectly tracked beats in 34 out of 40 popular songs that did not include drumsounds. This result showsthat our beat-tracking modelbased on multiple-agent architecture is robust enoughto handle real-world audio signals. Multiple-agent Architecture for Beat Tracking In this section we specify the beat tracking problem that we are dealing with and present the main difticulties of tracking beats: ambiguityof interpretation and the ]teed for context-dependentdecisions, difficulties which are common to other real-world perceptual problems. Wethen describe the multiple-agent architecture to address the beat tracking problem,defining our agents and outlining their interaction. Beat Tracking Problem In our formulation, beat tracking is delined as a process that organizes music into ’almost regularly spaced beats correspondingto quarter notes. Our beat tracking problemis thus to obtain an appropriate sequence of beat times (temporal positions of beats) that corresponds to input musical audio signals (Figure 1). This sequenceof beat times is called the quarter-note level. Wealso address the higher-level beat tracking problem of determining whether a beat is strong or weak(beat type) ] under the assumption that the timesignature of an input song is 4/4. This is the problem of tracking beats at the half-note level. Thereare various difficulties in tracking the beats in realworld musical acoustic signals. The simple technique of peak-lindingwith a threshold is not sufficient since there are manyenergypeaksthat are not directly related to beats. Multiple interpretations of beats are possible at any given point directly indicates the beat positi,n; the beat is a perceptual concept that a humanfeels in InUSiC. There are va.rions ambiguous situations, such as ones where several events obtained by frequency analysis maycorrespond it) a heat and wheredi ft’erent inter-beat intervals (the temporaldifference betweentwo successive beats) seem plausible. In addition. higher-level processing using musical knowledgeis necessary for makingcontext-dependent decisions, such ~ks determining whether a beat is strong or weakand evaluating whichis the best interpretation in an ambiguoussituation. Our solution to the problemof handling ambiguoussituations is to maintain multiple hypotheses, each of which correspondsto a provisional or hypothetic.’d interpretation of the input. A real-time systemusing only a single hypt~thesis is subject to garden-patherrors. A multiple-hypothesis systemcan pursue several paths simultaneously,and laler decide whichone wascorrect. In other words,in real-time beat tracking these hypothesesrepresent the results of predicting the next beat in different waysand it is impossibleto knowin advancewhichone will be correct (becausetile future events are not available). Multiple-agent Architecture To examine multiple hypt~theses in parallel, we use a multiple-agciRarchitectttre in whichagents with different strategies interact through cooperation and competition to track beats (Figure 2). Several delinitions of the term agent have been proposed (Minsky 1986: Maes 1990; SIR~ham 1993; CACM 1994; Nakatani, ()kuno, & Kawabata 1994; ICMAS 19957ill our terminology, the tern] agent Ineans a software compunentthat satisfies the following three reqnirements: 1. the agent interacts with other agents to perform a given t~sk. 2. the agent evaluates its ownbehavior on the basis t)f the input. 3. the agent adapts to the input by adjustitlg its ow,behavior. Each agent maintains a beat-position hypothesis, which consists of a predicted next-beat lime, its heat type (strong IIn this paper,a strongbeatis either the lirst nr third quarler notein a measure;a weakbeat is the secondor fourth. predict 2 I I I I I l Agent3 I I I I I I .. I .., i Agent4 I I I I . I I I t.__ Agent5 ... I I I I I I . I Agent Musical acousUc signals " //’if/¢, Beat times (quaaer’n°te level) Bsatlype ] ~haf-nole level)t~ J I [ Strong Weak I Strong Weak I ICMAS-96 predicted J__next-beat time I Strong Weak Figure 1: Beat tracking problem. 104 inter-beat interval Figure 2: agents. ¯fimc Multiple hypotheses maintai,ed by multiple From: Proceedings of the Second International Conference on Multiagent Systems. Copyright © 1996, AAAI (www.aaai.org). All rights reserved. "u.ntl-1 I I I lp"r ¯ lime ........... ,, h,bitt A.._._t~i~ ~;~’~\ compact disc O ]~’Higher-level" checkers ,J¢~,~ .,,,,,,,, prediction How field audio, ignals Figure 3: Interaction between agents through a prediction field. or weak),and the current inter-beat interval. In makingthe hypothesis, the agent interacts with other agents to perform the beat-tracking task (the lirst requirement). All agents are groupedinto pairs that havedifferent strategies for beat tracking. Each agent in the pair examinesthe same interbeat interval using the samefrequency-analysis results. To predict the next beat times cooperatively, one agent interacts with the other agent in the same pair through a prediction field. Theprediction field is an expectancycurve- that represents whenthe next beat is expectedto occur (Figure 3). The height of each local peakin the prediction tield can be interpreted as the next beat-position possibility. The two agents interact with each other by inhibiting the prediction field in the other agent. Thebeat time of each hypothesisinhibits the temporaily corresponding neighborhoodin the other’s field (Figure 3). This enables one agent to track the correct beats even if the other agent tracks the middleof the two successive correct beats (whichcompensatesfor one of the typical tracking errors). Eachagent is able to evaluate its ownhypothesis, using musical knowledge,according to the input acoustic signals (the secondrequirement). Wecall the quantitative result this self-evaluationthe reliability of the hypothesis.Thefinal beat-tracking result is determinedon the basis of the most reliable hypothesis that is selected from the hypothesesof all agents. Each agent aiso adapts to the current input by adjusting its ownstrategy parameter (the third requirement). If the reliability of a hypothesis becomeshigh enough, the agent tunes a parameterto narrowthe range of possible inter-beat intervals so that it examinesonly a neighborhoodof the current appropriate one. This enables the agent to maintain the hypothesisthat has the inter-beat intervai appropriate to the current input. System Description 3The system for musical audio signals without drum-sounds assumes that the time-signature of an input song is 4/4 and that its tempo is constrained to be between 61 M.M. 2Oilier systems (Desain1992; Desain&Honing1994:Vercoe 1994)haveused a similar conceptof expectancycurve for pz~dieting future events, but not as a meansfor managing interaction amongagents. 3A detailed description of our beat-trackingsystemfor audio signals that includedrura-soundsis presentedin tGoto&Muraoka 1995a:1995b). - ~~---"¢" Onset-time finders Onset-time vectorizere Manager Agents ..... "~ ~ |--I~re(l~n~-----~~-~’~._b~eeat ~. Nu uolwerslon ~.~ ~B..:. ~lleall Preolcll0fl ~" ~ r S 14keIMlyJll :k r S Inf0rmdon~ ~ .r.~...z..:.. ~ r ~ el41 lalllwPJ//I ~" Figure 4: Processing model. ~.~k;slcai’~;;di~o~i~nals-"i Time-slgnaturs: 414 ................ ~.................. Tempo: 61-120 M.M. I ~ CorlversJon ! l l~.t Fourier Tr.,form I Frequency Analysis[Extracting onset colnponentsJ i °i I i Hl~her.level 6heckers onset-time vectore i Beat Manager Prediction~ most reliable hypothesis ! BITransmlssion ! ............... (. J. .... Beat ,n,orma~on "~i Be~t ,iRma. tempo Best type, Current Figure5: Overview of o~beat trackingsystem. (M~ilzcl’s Metronome: the number of quarter notes per minute) and 120 M.M., and is roughly constant. The emphasis in our systemis on 1inding the temporalpositions of quarter notes in audio signals rather than on tracking tempo changes. The system maintains, as the real-time output, a description cailed beat information(BI) that consists of the beat time, its beat type, and the current tempo. Figure 4 is a sketch of the processing modelor our beat tracking system, and Figure 5 showsan overviewof the system. The two main stages of processing are Frequem’yAnalysis, in whichseveral cues used by agents are detected, and Beat Prediction, in whichmultiple hypothesesof beat positions are examinedby multiple agents. Since accurate onset Goto lO5 From: Proceedings of the Second International Conference on Multiagent Systems. Copyright © 1996, AAAI (www.aaai.org). All rights reserved. times arc indispensable for tracking beats, in the Frequency Analysis stage, the systemuses multiple onset-time finders that detect onset times in several different frequencyranges. Thoseresults are transformed into vectorial representation (called onset-time vectors) by several onset-t#ne vectorizers. In the Beat Prediction stage, the system managesmultiple agents that, accordingto different strategies, makeparallel hypotheses based on these onset-time vectors. Each agent lirst calculates the inter-beat interval and predicts the next beat time; it then infers the beat type by communicating with a higher-level checker(described later), and evaluates the reliability of its ownhypothesis. The managergathers all hypotheses and then determines the final output on the basis of the mostreliable one. Finally, the systemtransmits BI to other application programsvia a computernetwork. The following describe the main stages of Frequency Analysis and Beat Prediction. Frequency Analysis In the Frequeno,Analysis stage, the frequency spectrumand several sequences of n-dimensional onset-time vectors are obtained for later processing (Figure 6). The full frequency band is split into several frequencyranges, and each dimension of the onset-timevectors correspondsto a different frequencyrange. This representation makesit possible to consider onset times of "all the frequencyrangesat the sametime. Eachsequenceof onset-time vectors is obtained using a different set of weightsfor frequencyranges. Onesequence,t’tw example, focuses on middle frequency ranges, and another sequence focuses on low frequency ranges. Fast Fourier Transform (FFT) The frequency spectrum (the power spectrum) is calculated with the FFTusing the Hanningwindow.Each time the FFTis applied to the digitized audio signal, the window is shifted to the next frame. Figure 6: Anexampleof frequencyspectrumand an onsettime vector sequence. 106 ICMAS-96 In our current implementation,the input signal is digitized at 16bit/22.05kHz,and two kinds of FFTare calculated. ()ne FFT,for extracting onset componentsin the FrequencyAnalysis stage, is calculated with a windowsize of 1024samples (46.44 ms), and the window is shifted by 256 samples( l 1.61 ms). The frequency resolution is consequently 21.53 Hz and the time resolution (l frame-time) is I 1.61 ms. The frametime is the unit of timeused in our system,trod the term time in this paper is defined as the time measuredin traits of the frame-time. The other FF]’, tor examiningchord changes in the Beat Prediction stage, is simultaneouslycalcuhttedin attdid down-sampledat 16bit/11.025kHzwith a windowsize of 1024 samples (92.88 ms), and the windowis shifted by 128 samples(11.61 ms). Tile frequencyaa~d time resolution are consequently 10.77 Hz and 1 frame-time. Extracting Onset Components Frequency components whosepowerhas been rapidly increasing :tre extracted as onset components. The onset componentsand their degree of onset (rapidity of increase in power)are obtainedfrom the frequencyspectrumby a process that takes into account the power present in nearby time-frequency regions. Moredetails on the methodof extracting onset componentscan be found in (Goto&Muraoka1995a). Onset-time Finders Multiple onset-time linders (seven in our current implementation)detect onset times in several diflerent frequency ranges (0-125 Hz. 125-250Hz, 250-500Hz, 5OOHz-IkHz, 1-2 kHz, 2-6 kHz, and 6-11 kHz). Each onset time is given by the peak time fimndby peak-pickingin I)(t) alongthe time axis, whereD(I) ~"~.t d(t f) . and ,l (I, f l is the degree of onset of frequency f at time 1. Limiting the range of frequency for the summationof D(t) makesit possible to lind onset times in the different frequencyranges. Onset-time Vectorizers Each onset-time vcctorizcr transforms the results of all onset-time finders into sequencesof onset-time vectors: the sameonset times in all the frequency ranges are put together into a vector. In the current system. three vectorizers transform onset times from seven Jinders into three sequences of seven-dimensionalonset-time vectors with the different sets of frequencyweights(l\)cusing on all/low/middle frequencyranges). Ti)ese results arc sent to agents in the Beat Predictionstage¯ Beat Predicthm Multiple agents interpret the sequencesof onset-timevectors according to different strategies and maintaintheir ownhypotheses. Musical knowledgeis necessary to determinc the beat type (strong or weak) and to evaluate which hypothesis is best. For the audit) signals without drum-sounds,the system utilizes the following musical knowledge: 1. Soundsare likely to occur onbeats. In other words, the correct beat times tend to coincide with onset times. 2. Chordsare morelikely to changeat the beginningt)l" incasures than at other positions. 3. Chordsare morelikely to changeon beats (quarter-notes) than on other positions betweentw(~ successive correct beats. From: Proceedings of the Second ,’~ International Conference on Multiagent Systems. Copyright © 1996, AAAI (www.aaai.org). All rights reserved. Onset-time ............................... Parameter vectodzers/ 1 ¯ I"g~n~I~t~,J’. ........ Agents.,.../..::::[:!/tii~::’~;i~: :’ ] ...... f’ ’k.]~,.,z::’~:"::~:.’"’-: Higher-level / rParameters i i i check~r.s ........... ~ / rre~uency ro~.sl~p. ~I i i i I r l I --t i ~...,.’ "l~I’[ ;,....:1 ~t~-b,~,,,,a,~m/ L ~e~.tp~ks,t,,~anJ,....,’V" J-’-, .... v " Hypothems" Hypothesis ....... Nextbeattime i Next b~ t~e :’ i i I tn~,r-beain~rwtJ.....b’[ Zete~e~’in~r,~ ..... : Figure 7: Relations betweenonset-time vectorizers, agents, and higher-level checkers. To utilize the second and third kinds of knowledge, each agent communicates with a corresponding higher-level checker, whichis a moduleto provide higher-level information, such as the results of examiningthe possibility of chord changesaccording to the current hypothesis (Figure 7). The agent utilizes this informationto determinethe beat type and to evaluate the reliability of the hypothesis. Eachagent has four parameters that determine its strategy for makingthe hypothesis(Figure 7), and the settings these parametersvary from agent to agent. The lirst parameter, frequency focus ~.’pe, determineswhichvectorizer the agent receives onset-time vectors from. This value is chosen from amongtype-all, type-low, and type-middle, respectively correspondingto vectorizers focusing on all frequency ranges, low frequency ranges, and middle frequency ranges. The second parameter, autocorrelation period, determines the windowsize for calculating vectorial autocorrelation of the sequence of onset-time vectors to determine the interbeat interval. Thegreater this value, the older the onset-time itfformation considered. The third parameter, inter-beat interval range, controls the range of possible inter-beat intervals. Asdescribedlater, this limits the range of selectit)g peakin the result of the vectorial autocorrelation. The fourth parameter,initial peakselection, takes a value of either primary or secondary. Whenthe value is primary, the largest peakin the prediction field is initially selected, and the peak is considered as the next beat time; whenthe value is secondar).,, the secondlargest peak is selected. This helps to obtain a variety of hypotheses. In our current implementation there are twelve agents groupedinto six agent-pairs, and twelve higher-level checkers corresponding to these agents. Initial settings of the strategy parametersare listed in Table 1. As explainedin Section Multiple-agent ArcMtecture,the parameterinter-beat interval range is adjusted as the processinggoes on. The following sections describe the formation and mat)agement of hypotheses. First, each agent determines the inter-beat interval using autocorrelation; it then interacts with its paired agent through the prediction field that is Table1: Initial settings of the strategy parameters. pair frequency autointer-beat initial -agent focus type correlation interval peak period range selection 1-I type-all 500f.t. 43-85f.t. prmaary 1-2 type-all 500f.t. 43-85f.t. secondary 2-1 type-all 1000f.t. 43-85f.t. prmaary 2-2 type-all 1000f.t. 43-85f.t. secondary 3-1 type-low 500f.t. 43-85f.t. prmaary 3-2 type-low 500f.t. 43-85f.t. secondary 4-1 type-low I(X)O f.t. 43-85f.t. primary 4-2 type-low 1000f.t. 43-85f.t. secondary 5-1 type-middle 500f.t. 43-85f.t. prtmary 5-2 type-middle 50()f.t. 43-85 l’.t. secondary 6-1 type-middle 1000f.t. 43-85f.t. primary 6-2 type-middle 1000f.t. 43-85f.t. secondary ""Lt.’" is the abbreviationof frame-time ¢ 11.61ms). formed using cross-correlation, and predicts the next beat time. Second, the agent communicateswith the higher-level checker to infer the beat type and evaluates its ownreliability. The checker examinespossibilities of chord changes by analyzing the frequencyspectrumon the basis of the current hypothesis received from the agent. Finally, the manager gathers all the hypotheses,and the most reliable one is considered as the output. Beat-predicting Agents In our formulation, beats are characterized by twoproperties: period (inter-beat interval) and phase. Thephaseof a beat is the beat position relative to a reference point, usually the previous beat time. Wemeasure phase in radians; for a quarter-note beat, for example, an eighth-note displacementct~rresponds to a phase-shift of r radians. Eachagent first determinesthe current inter-beat interval (period) (Figure 8). The agent receives the sequence onset-timevectors and calculates their vectorial autocorrelation a. The windowedand normalized vectorial aumcorrelation tYmctionAe(r) is dclined as E[=~-,..(~(t)-a’(t r)).,,,(~.AcCr)= ~[=,.-w (6(t~. ~(t))- t~:(~. - t~ (l) where~(t) is the n-dimensionalonset-time vector at time c is the current time, and Wis the strategy parameterautocorrelation period. The windowfunction w(/)is given l u, Ct) = 1.0 - 0.5~. (2) The inter-beat interval is given by the r with the maximum height in At(r) within the range limited by fl~e paranmter inter-beat interval range. To determinethe beat phase, the agent then formsthe prediction field (Figure 8). The prediction field is the result of calculating the cross-correlation function betweenthe sequence of the onset-time vectors and the sequence of beat 4Thepaper(Vercoe1994)also proposedusing.’~ variant of autocorrelationfor rhytlmaicanalysis. Goto 107 fieldon Multiagent Systems. Copyright © 1996, AAAI (www.aaai.org). All rights reserved. From: Proceedings ofInter-beat the Second International Conference interval ~ prediction (by cross-correlation) (by autocorrelatJon) IkHzp .---,~-,. : ......." : : ""’"": " ¯ .... i- : " ’ " : ¯ :: - : i. - : " : : i-- - - : ’ ~ "- - ¯ : ": - ;- : .- " " - ; " :! e "~. ~ -- " --. ’ : ~- ~-! ¯~..-=.__~-I,p:. _~L =;" =-’---~1--, q ~’=-~’,=’- ::,.--:- Quarter-note chordchange I possibility . i I 01]Z~-,’ howcoincide StrongI~ WeakI~ StrongI~ WeakI~ Strong Eighth-note ~ time chord change I possibility : ’~...L..: ][i ’l’O~ 0.0 ........ i...[’.i’"i I i I ~ : fls OS ~,. eighth-notedisplacement positions : ¯ ............. ~’..r-..’.l-’..dr.-L...~.l.~".....r--: F.s 1Os -- "Li .i.ri. ; ii , ,,,,,,i t I!,i]i ,’~s ", Time ., "[’in p.= Ills (a) Examiningquarter-note chord changepossibility Figure 8: Predicting the next beat. times whoseinterval is the inter-beat interval. As mentioned in Section Multiple-agentArchitecture, the two agcnLsin the samepair interact with each other by inhibiting the prediction field in the other agent. Eachlocal peakin the prediction field is consideredas a possible beat phase. Whenthe reliability of a hypothesisis low, the agent initially selecLs the peakin the prediction tield accordingto the parameterinitial peakselection, and then tries to pursue the pe’a.k equivalent to the previously selected one. This calculation corresponds to evaluatingall possibilities of the beat phaseunderthe current inter-beat interval. The next beat time is thus predicted on the basis of the inter-beat interval and the current beat phase. The agent receives the two kinds of possibilities of chord changes,at the quarter-notelevel and at the eighth-notelevel, by conununicating with the higher-level checker. Wecall the former the quarter-note chord changepossibi/i~." and the latter the eighth-note chord changepossibility. The quarternote (eighth-note) chord changepossibility represents how chord is likely to changeon each quarter-note (eighth-note) position underthe current hypothesis. To infer the beat type, we use the second kind of musical knowledge,width meansthat the quarter-note chord change possibility is higher on a strong beat than on a weakbeat. If the quarter-note chord changepossibility is high enough, its time is consideredto indicate the position of the strong beat. The tbllowing beat type is then determinedunder the assumptionthat strong and weakbeats alternate (Figure 8). Theagent finally evaluates the reliability of its ownhypothesis by using the first and third kinds of musical knowledge. Accordingto the first kind, the reliability is determined according to howthe next beat time predicted on the basis of the onset times coincides with the time extrapolated from the past two beat times (Figure 8). If they coincide, the reliability is increased; otherwise,the reliability is decreased. Accordingto the third kind of knowledge,if the eighth-note chord change possibility is higher on beats than on eighthnote displacementpositions, the reliability is increased; otherwise, the reliability is decreased. Higher-level Checkers For the audio signals without drum-sounds,each higher-level checker examinestwo kinds or chordchangepossibilities accordingto the hypothesesre108 ICMAS-96 lkI-Tz.],z"-",~’,~’~"! {’.,’"’,,’":’:" ??[’[’-’"[’"’i"ii" ~’~’£’-’.~’;’~’’]~:!l’" -~::_. Y ~,---.;-PP" ’,¯.’--pt--" ,~’::.: " ~..’-.’-.~---: ["-: i_~;.;-=-=--~¯ ~" : ~: .,-~.,;.~ ..... ;-;-k~-~¯ .."~’.’’: q :-- -.,-P~-P.r.; -.,-r : ¯ ._L~..p,"-~=e....... --P -..,-~ . . : : ;---P~-~’-"::’:: ’.-I ’ - ~-;’~’--P-:’-----" -" -=..’-;---P:--: --’=’::-| " .’-~: : -:__-~--pp-?-y-,--r ::=:=-~-;~.~ ,_’:~..=:~...=~=--.--~;e=-.,-[ _=._-..=~=~.=. r .--.2_e~..-F=..__-: i ~[-;--, ~--;--.-r,---~---: ---r -~,~--~’~’’’,~,~ IF-- : O~ l°]i"[[li: (Is .5~ 71[iT "r " i :: ’: ..... ,"is ills ......... ;, Time ~~ !:.~ .... i(1~ - Till ll’. (b) Examiningeighth-mite chord change possibility Figure 9: Examplesof peaks in sliced frequency spectrum and chord changepossibility. ccived from the corresponding agent. The checkerfirst slices the frequencyspectruminto strips at the quarter-note times (beat times) ft~r examimngthe quarter-note chord change possibility, and slices at the eighth-note times interpolated trom beat times for examining the eighth-note chord changepossibility (Figure 9). The checker then finds peaks along the frequencyaxis in a histogram summed up along the tilne axis in each strip. These peaks can be considered as the dominanttones" pitches in each strip. Somepeaks maybe componentsof a chord, end others maybe componentsof a melody. Our current implementationconsiders only peaks whosefrequencyis less than ! kHz. The checker evaluates the chord change possibilities by comparingthese peaks between adjacent strips. The more and the louder the peaks occur comparedwith the prcvitins strip, the higher the chord change possibilities. For the quarter-note (eighth-note) chord changepossibility, the checkercon]paresthe strips whoseperiod correspondsto flic quarter-note (eighth-note) duration underthe curre,I hypothesis. Figure 9 showsexamplesof two kinds ill chord oh:rage possibilities. The horizontal lines aboverepresent peaksin each strip’s histogram. TheOlick vertical lines belowrepresent the chord changepossibility. The beginningt)f measure comesat every four quarter-notes from the extreme left in ta"), and the beat comesat every twoeighth-notes lrt~ln the extremeleft in (b). From: Proceedings of the Second International Conference on Multiagent Systems. Copyright © 1996, AAAI (www.aaai.org). All rights reserved. Hypotheses Manager The manager classifies all agentgenerated hypotheses into groups according to beat time and inter-beat interval. Eachgroup has an overall reliability given by the sumof the reliabilities of the group’s hypotheses. The managerthen selects the dominantgroup that has the highest reliability. Since a wronggroupcould be selected if temporarily unstable beat times split the appropriate dominant group, the mauagerrepeats groupingand selecting three times while narrowing the allowable margin of beat times for becomingthe samegroup. The reliable hypothesis in the most dominantgroup is thus selected as the output and sent to the BI Transmissionstage. The managerupdates the beat type in the output using only the beat type that was labeled whenthe quarter-note chord change possibility was high comparedwith the recent maximumpossibility. Whenthe possibility was not high enough, the updated beat type is determinedfrom the previous reliable beat type based on the alternation of strong and weak beats. This enables the systemto disregard an incorrect beat type that is causedby a local irregularity of chordchanges. Implementation on a Parallel Computer Parallel processingprovides a practical and feasible solution to the problem of performing a computationally intensive task, such as processing and understanding complexaudio signals, in real time. Our system has been implementedon a distributed-memoD,parallel computer, the Fujitsu AP1000 that consists of 64 processingelements(Ishihataet al. 1991). A different element or group of elements is assigned to each module, such as FFT, the onset-time finder, the onset-time vectorizer, the agent, the higher-level checker, and the manager. These modules run concurrently and communicate with others by passing messages between processing elements. Weuse four kinds of parallelizing techniques in order to execute the heterogeneous processes simultaneously (Goto &Muraoka1996). The processes are first pipelined, and then each stage of the pipeline is implementedwith data/control parallel processing, pipeline processing, and distributed cooperative processing. This implementationmakesit possible to analyze audio signals in various ways and to managemultiple agents in real time. Experiments and Results Wetested the system for audio without drum-soundson 40 songs performedby 28 artists. The initial one or two minutes of those songs were used as the inputs. The inputs were monaural audio signals sampled from commercial compact discs of thepopular musicgenre. Their tempi ranged from 62 M.M.to 116 M.M.and were roughly constant. It is usually moredifficult to track beats in songs without drum-sounds than in songs with drum-sounds,because they tend to have fewer sounds which fall on the beat and musical knowledge is difficult to applyin general. In our experiment,the systemcorrectly tracked beats (i.e., obtained the beat time and type) in 34 out of the 40 songsin real time5. In each song wherethe beat was eventually determinedcorrectly, the systeminitially had trouble determining the beat type, even thoughthe beat time was correct. Within at most fifteen measuresof the beginningof the song, however, both the beat time and type had been determinedcorrectly. In most of the mistaken songs, beat times were not obtained correctly since onset times were very few or the tempt) fluctuated temporarily. In other songs, the beat type was not determinedcorrectly becauseof irregularity of chord changes. Theseresults showthat the systemis robust enoughto deal with re’d-world musical signals. Wehave also developedan application with the systemthat displays a computergraphics dancer whosemotion changes with musical beats in real time (Goto & Muraoka1994). This application has shown that our systemis also useful in multimediaapplications in whichhuman-likehearing ability is desirable. Discussion Our goal is to build a system that can understand musical audio signals in a human-like fashion. Webelieve that an importantinitial step is to build a systemwhich,even in its preliminary implementation,can deal with real-world acoustic signals like those sampled from compact discs. Most previous beat tracking systems had great difficulty working in real-world acoustic environments, however. Most of these systems (Dannenberg & Mont-Reynaud1987; Desain & Honing 1989; Allen & Dannenberg1990; Driesse 1991; Rosenthal 1992) have dealt with MIDIsignals as their input. Sinceit is quite difficult to obtain completeMIDIrepresentations from audit) data, MIDI-basedsystems are limited in their application. Althoughsomesystems (Schloss 1985; Katayoseet al. 1989) dealt with audio signals, they had difficulty processing music played on ensembles containing a variety of instruments and did not workin real time. Ourstrategy of first building a real-time systemthat works in real-world complex environments and then upgrading the ability of the systemis related to the scaling-up problem (Kitano 1993) in the domainof artiticial intelligence (Figure 10). As Hiroaki Kitano stated: experiences in expert systems, machinetranslation systems, and other knowledge-basedsystems indicate that 5Ourother beat-trackingsystemtot audiosignals that include drum-sounds, whichis basedon a similar multiple-agentarchitecture, correctlytrackedbeaL~ in 42out of the 44 songsthat included drum-sounds(Gore&Muraoka1995b). Task complexity Scalability Toy ,~ Intelligent system system t Systems that pay-off I ~- Useful system Domain size (closeness to the real-world} Figure 10: Scaling-up problem (Kitano 1993). Goto 109 From: Proceedings of the Second International Conference on Multiagent Systems. Copyright © 1996, AAAI (www.aaai.org). All rights reserved. scaling up is extremely difficult for manyof the prototypes. (Kitano 1993) In other words, it is hard to scale-up a system whosepreliminary implementation works only in laboratory environments. Wethink that our strategy addresses this issue and that the application of multiple-agent architecture makes the system robust enough to work in real-world environments. Some researchers might regard as agents several modules in our system, such as the onset-time finders, the onset-time vectorizers, the higher-level checkers, attd the manager. In our terminology, however, we define the term agent as a software componentof distributed artificial intelligence that satisfies the three requirements presented in Section Multipleagent Arcizitecture. Wetherefore do not consider those modules agents: they are simply concurrent objects. Conclusion We have presented a multiple-agent architecture for beat tracking and have described the configuration and implementation of our real-time beat tracking system. Our system tracks beats in audio signals containing sounds of various instruments and reports beat information in time to input music. The experimental results show that the system is robust enough to handle real-world audio signals sampled frotn compact discs of popular music. The system manages multiple agents that track beats according to different strategies in order to examine multiple hypotheses in parallel. This enables the system to l’ollow beats without losing track of them, even if some hypotheses becomewrong. Each agent cat) interact with other agents to track beats cooperatively and can evaluate its own hypothesis according to musical knowledge. Each agent also can adapt to the current input by adjusting its ownstrategy. These abilities make it possible for the system to handle ambiguous situations by maintaining various hypotheses, and they make the system robust and stable. We plan to upgrade the system to make use of other higher-level musical structure, and to generalize to other musical genres. Future work will include application of the multiple-agent architecture to other perceptual problems and will also include a study of more sophisticated interaction among agents and more dynamic multiple-agent architecturc in which the total numberof agents is not fixed. Acknowledgments Wethank David Rosenthal for his helpful comments on earlier drafts of this paper. Wealso thank Fujitsu Laboratories Ltd. h)r use of the APIO(X). References Allen, P. E.. and l)annenberg. R. B. 1990. Tracking musical beats in real time. In Prec. of the 1990 Intl. C’omptgerMusic Confi. 140-143. CACM. 1994. Special issue on intelligent agents. Communications o.fthe ACM37(7): 18-147. l)annenberg, R. B.. and Mont-Reynaud,B. 1987. Following an improvisation in real time. In i)roc, of the 1987 Intl. Computer Music Conf., 241 - 248. 110 ICMAS-96 l)esain, P.. andHoning,H. ! 989. Thequantization t)f musicaltime: A connectionist approach. ComputerMusic Journol 13(3):56- 66. Desain, P., and Honing,H. 1994. Advancedissues in beat induction modeling:syncopation,tempoand timing. In Prec. if the 1994hltl. ComputerMtt~ic Conf.. 92-94. Desalt. P. 1992. Can computermusic benelit from cognitive models of rhythm perception? In Pnw. of the 1992 Intl. Computer Mr(sic Cotg, 42-45. Driesse, A. 1991. Real-time tempotracking using rules to :malyze rhythmicqualities. In Proc. if the 1991Intl. (hmtputerMusicCm’ll. 578-581. Goto, M.. aml Muraoka,Y. 1994. A beat tracking systemfi)r acoustic signals of music. In Proc. t!f the SecondACMhztl. Confi on Multimedia. 365-372. Goto, M., and Muraoka,Y. 1995a. Musicundel~landingat the beat level - re’d-time beat tracking for audio signals -. In Working Notes of the IJCAI-95 Workshopon ComputationalAuditor) So’erie Analyds. 68-75. (;oto, M.. and Muraoka.Y. 1995b. A real-time beat tracking system for audiosignals. In Pmc’.of the 1995bztl. Comptttc,rMusicCt,t~i. 171-174. (loto. M.. and Mur.’mka. Y. 1996. Parallel implement:Riouof beat tracking system- real-time musical inlk~rmationprocessing on A P 1000- t hz Japanese).Transactionst f h!/b (marionPrt,¢’cx ring Society of Japm, 37(7): 1460-1468. ICMAS. 1995. Proc.. First httl. Cont. on Multi-Agent Systems. The AAAIPress / The MITPress. lshi.hata. H.; Hofie, T.: Inano, S.; Shimizu.T.; and Kato. S. 1991. Anarchitecture of highly parallel computerAP1001).In lt’J-E !’o.. cific Rbn Conf. on Communications. Computers.Signal Pn,, "essitzg. 13-16. Katayose, H.; Kato, H.; Imai. M.: and Inokuchi, S. 1989. An approachto an artificial music expert. In Proc. (!f the 1989Intl. Computer Mus&Con.[:. 139-146. Kitano. H. 1993. Challengesof massivepm’allelism. In Prc~,c. t,.] IJCAI-93, 813-834. Large. E. W.1995. Beat tracking with a nonlinear oscilhttcn’. In Woddng Notes ~f the IJC, tl-95 Workshopon Art~/t,’ial bzlelligence and Music. 24-31. Maes, P., ed. 1990. Designing Atttonomous AgettlS: Theory tuzd Practice from Biology to Engineering mul Back. The MITPress. Minsky, M. 1986. The Society of Mind. Simon&Schuster. Inc. Nakatani, T.: Okuno. H. G.: and Kawabata. T. 1994. Auditory streamseg/’egationin auditory sceneanalysis. In Proc.(,./A.4AI-94. 100-1(17. Rosenthal. D.; Goto. M.; and Muraoka.Y. 1994. Rhythmtracking using multiple hypotheses. In Proc. ~" the 1994 Intl. Computer Music Conf.. 85-87. Rosenthal. I). 1992. Mm’hincRhxthut: (’olnpttter lZ.mulation of HumanRhythmPerception. l)h.D, i)isserlalion. Ma,;sachusetts Institute of Technology. Schloss. W.A. 1985. OnThe Atttomtttic" l’rans,riptiof (?/l’ercussire Music- FromAc’uttsti," S(gnal to High-lx, rd Analysis. Ph.I ). Dissertation, CCRMA. Stanford University. Shoham.Y. 1993. Agent-orientedpta)gramming.Art(!i(ial h~tclli. gence 60( 1):51-92. Vercoe, B. 1994. Perceptually-based musicpattern recognititm mid response.In l)roc, of the "l’hild Intl. C’ol!llfi,r tit(" I)t’rc’t’ptimz Cognition of Music, 59-60.