Latin Rhythms Classification on MIDI Files

advertisement
Latin Rhythms Classification on MIDI Files
by Arturo Camacho
Abstract
This paper describes an algorithm that classifies songs (in MIDI format) into
different music styles. Although the principle we use is general enough to include
several music styles, we restrict our work to three of the most popular Latin
American styles: Salsa, Merengue, and Cumbia. The reason is that these styles
have well defined patterns for some instruments, among them the bass and the
conga, which is exploited by our algorithm. We studied the patterns that those
instruments play and created templates for them. In order to classify a song, we
compare it against each of the created templates, and we select among all the
templates the one that matches stronger the input song. Then we declare that the
song belongs to the style represented by such template. One problem with this
approach is that patterns inside a style are not unique. There exist many
variations of a style. Furthermore, within the same song, any instrument usually
does not always play the same pattern. Here we describe a method to deal with
such variances and classify songs using degrees of membership to styles.
Introduction
The goal of this project is to classify music according to its style (rhythm). The rhythms
to be classified are Salsa, Merengue and Cumbia, which are some of the most common
Latin rhythms. The songs to be analyzed will come in the form of MIDI files. The reason
for choosing this format is that doing classification on pure audio is very difficult, at least
for the approach we will use in this project, which is based on the position of the notes
within a measure.
The first approach will be based on a feature that is usually distinguishable on each of the
rhythms: the bass line. To recognize which is the bass line, we will use the General MIDI
(GM) specification. Within GM, programs correspond to bass sounds are predefined.
Then, we will analyze the conga, a percussion instrument common to the three rhythms.
It is also easy to extract because come in pre-defined keys and channel. Finally we will
use information about the tempo. However, it works only to distinguish Merengue from
the others, because Salsa and Cumbia tempos typically fall into the same range.
The techniques to classify the songs will be based in template matching, a measure of
similarity between the input data and pre-defined templates for each of the styles. We
will start with fixed templates but then will move to more flexible structures the allow
variance in the styles. Finally, we will discuss details of the implementation.
Feature Extraction
To classify a song into a category we will extract from the song two of the most
characteristic instruments in Latin Music: bass and congas. The bass is easy to extract. As
every MIDI channel has to have a program assigned (from 0 to 127) and the MIDI
Standard establish which instrument corresponds to each program, we just have to look
for the programs corresponding to basses. These are located between the programs 32 and
39. Therefore, to extract the bass line we just have to extract the notes from the MIDI
channels that have one of those programs assigned. Something similar occurs with the
congas. They are located in keys 62, 63, and 64 in the standard percussion channel
(channel 10). Thus, we have only to extract the notes corresponding to these keys in that
channel.
Template Matching
Once obtained the notes to be used to recognize a song, we have to define templates for
the patterns that those instruments use to play in each of the categories to recognize. The
templates for the bass are shown in figure 1. As it can be seen, the template for each of
the categories consists in a two-beats measure long. This is not always the case for Salsa.
However, in such cases the difference is so slight that we still can obtain good results
using this two-beats template.
Figure 1. Bass templates for Merengue, Cumbia, and Salsa. Bass templates for these
rhythms emphasize distinct parts of the measure and most songs follow very closely these
templates. This is what make the bass a good feature for classification.
Intuitively, if a bass plays along the whole song the pattern given in one of the templates,
it should be very likely that it belongs to that category. However, this is not always true.
It could be that the other instruments are playing things that do not belong to that
category and that the overall result is a different category. In other words, if we are
analyzing only one instrument from a song that contains many of them, we cannot be sure
about the decision we take. An example is some types of Flamenco. Many Flamenco
songs have a bass line similar to the bass line of Salsa. Nevertheless, this is the approach
we are going to follow. We will extract and analyze only a few instruments (the ones that
have more well defined patterns) and will base our decision on them. The reason is that
for some instruments the variance of styles of playing is so wide that it is difficult to
define a template.
If an instrument does not match perfectly any of the templates (what occurs most of the
time) along the whole song, we still would like to know if it belongs to one of those
categories. Again, this can be true because it is very unlikely that a template is perfectly
matched along a complete song, although for a person that knows about music is clear
that the instrument is playing that style of music.
Therefore, we have to come up with a way to measure how similar is the pattern an
instrument is playing respect a template. After that, intuitively, we could establish a
threshold above which we can claim an instrument is playing a particular style of music.
The measure we propose is the following. Start with the analysis of similarity of only one
measure and come up with a score. Then, to obtain the overall score of the whole song,
sum the scores obtained on each of the measures.
In the case we have the same number of notes played in the template and in the input data
for a given measure, it sounds natural to give a score proportional to the distances of the
notes in the input and the template. However, it is not clear what to do in the cases in
which less or more notes are played in the input than in the template. Besides, this is not
the natural way we perceive music (at least not musical trained people). When we listen
to music we usually have in mind a “minimum common divisor” of the rhythm. Our mind
quantizes (relates) each sound with the closest of these units (see figure 2). In Latin
music, every beat is usually divided into four parts and every note played in the middle of
these is perceived as if it was intended to be played in that time but was slightly delayed
or ahead. We follow the way our brain simplifies music and we accommodate each note
in the input and in the template into one of the four parts (from now called bins) that
divide the beat.
Figure 2. Quantization. Notes that are in the middle of a minimal common division are
interpreted as equivalent to the bin center.
To compute the similarity of an input and a template in a given measure, we compare the
values of each of the bins in the input the template. If a bin is activated in both the input
and the template, or if it is deactivated in both, we increment by one the score for the
respective category. On the other hand, if a bin is activated in the input but not in the
template, or vice versa, we do not increment the score. As each of our templates consists
of a two-beat measure, we have a total of eight bins in a measure, and therefore a
maximum score of eight can be obtained on each measure. The total score for the whole
song will be the sum obtained on each of the measures.
Figure 3. Template Matching. Only if both the input and the template have the same state
(activated or inactivated) the score is increased. If they are different the score stays.
It is important to realize that this is a linear system. Instead of comparing the bins of
every measure against the template and accumulating the score, we could make a
histogram of the bins along the song and only at the end compare against the template. In
such case, for each of the eight bins we only would need to add it frequency f to the score
if the bin was activated in the template, or add to the score its complement m-f to the
number of measures m of the song.
Clearly, with this template approach, songs for which instruments are playing patterns
very similar to the template most of the time will have a very high score for that style. On
the results we obtained it could be seen that the templates we used were different enough
to give a significant difference in the score to be able to classify a song into a category
just by taken the one with highest score. Unfortunately, templates for styles are not
unique. There can be slight differences on templates for styles that if they are not taken
into consideration our could perform very bad.
Variations to Templates
As we mentioned when we started discussing “Templates Matching”, many times there
are slight variations in styles that makes necessary (1) to define more than one template
for a style, or (2) to define a more flexible structure that let accept variations into styles.
The first one is optimal because we could make as many templates as we want to match
all the variations we know. However, that would be computationally too expensive.
We came with a way to perform the second option producing almost as good results as
the first one, but requiring only few extra computations. The rule is not “rule of thumb”
and depends on the case. For example, in Salsa and Cumbia it is not always true that the
first note of the measure is not played. Actually, in most of Cumbias the first note is
played, but in some of them it is not. It can be seen therefore that such note does not
conform a criteria to decide whether or not a song is a Cumbia. Therefore, we could just
ignore this note from our analysis. This is not the case for the note in the fifth bin (the
first note in the template). This note is almost always played.
This idea suggests the following method. Define a vector v = [v1, v2,…,v8] such that vi 
{-1,0,1} for i  {1,2,…,8}. If there is a note in the i-th bin of the template and that note is
strictly required (as the 5th and 7th bins in the bass line for Cumbia), set vi=1. If it is just
optional (i.e., it cannot be used as a criteria, as the 1st bin in the bass line for Cumbia) set
vi=0. If there is no note in the i-th bin of the template (as the 2nd, 3rd, 4th, 6th, and 8th bins
in the bass line for Cumbia), set vi=-1. This means that if that note is played decrease the
score by one. In the Cumbia bass line case we would get in the best case a score of 2 and
in the worst case a score of –5. We can add a bias of 6 and therefore keep the score in the
range [0,12] (actually [1,12], with the free gift of the 1st bin). This criteria was used in all
the templates and helped to expand the variations recognized on each of the styles.
Figure 4. Templates with optional notes. Required notes have a value of positive one
such that if they are present in the input they contribute positively to the score. Optional
notes have a value of zero. If they are present or not it makes no difference. The rest of
the bins have a value of –1 decreasing the score if notes are present in the input.
Another difficult we met was related with the simplification we made using only twobeats measures. Latin Music measures are intrinsically four-beat based most of the time.
We made the simplification for two reasons. First, usually the first and the second part of
the four-beat measure are so similar that it made some sense to treat them as equal.
Second, the first part of the four-beat measure template and the second part can appear
permuted (i.e., in some songs the first one can come as second and vice versa).
Furthermore, this order can change inside a song itself. A typical case is the common
change of ”clave” in Salsa. Some Salsas start with the 3-2 clave and at more rhythmic
parts change to 2-3 clave.
To solve this problem of mapping a four-beats measure into a two-beats measure we
realized the following. If a part of the pattern is played only on one half the pattern but
not in the other, after collapsing the four-beats measures into two-beats measures that part
of the pattern should appear in the overall song only half the number of two-beat
measures (see figure 5). Therefore, as the number of times that subset of the pattern goes
away from the mean (half the number of two-beat measures), the similarity with the
template reduces and therefore the score should decrease. This can be done by replacing
the frequency of that bin with its distance from the mean and setting its value in the
vector to –1. Therefore, as the distance from the mean increases the score decreases.
Figure 5. Collapsing 4-beats measures into 2-beats measures. Putting the notes of the
bins 9 to 16 into bins 1 to 8 makes that the two lower notes (bins 12 and 13) appear in
bins 4 and 5 with half the frequency than the rest of the notes.
The last problem we have to solve is intrinsically related with the non-pitch-definiteness
of the percussion sounds. It turns out that for example in the conga line of the Cumbia the
tradition is to play a high pitch sound on the first upbeat (3rd bin) and a lower pitch sound
on the second upbeat (7th bin). However, as we have 3 conga sounds (mute, open hi, and
open low) each with different pitches, it turns out that we have three options to meet this
requirement. One way to solve the problem would be to define three templates, but as we
have discussed before, this is computationally expensive. As the template for conga has
only one sound at the time, playing only one note at the time subject to the defined pitch
constraints should produce a perfect score. The problem is that as we have two correct
conga sounds for each of the bins (e.g, mute and open hi for the 3rd bin, and open hi and
low for the 7th bin) we do not which to choose. Simply bias the system to benefit the
election of the style (Cumbia) selecting as the correct sound the one that was played the
most. If only that one was played and not the other, then the score will be maximum. If
the other note was also played it should decrease the score because according to the
template only one of them should be played.
Tempo Criteria
The last feature that we consider is one of the most simple but important: the tempo. It
turns out that the tempo (i.e., speed) for a specific style goes into an approximated range.
For example, Salsas and Cumbias usually have a tempo between 70 quarters per minute
and 105 quarters per minute. Merengues, however, goes from 120 quarters per second to
160 quarters per second. Therefore, this criteria can serve to distinguish Merengue from
the rest, but cannot distinguish from Salsa and Cumbia.
Implementation
In our implementation, we used the ideas described here for the conga but not for the
bass. The reason is that initially we start classifying the songs based only on the bass and
a simpler approach gave good enough results, so we did not implement the more complex
approach for the bass. We had to use it for the conga, though, because the simple
approach would not work for it.
For the classification based on the bass we just took the histogram and select the two
maximum bins. If they matched the bins of one of the templates, then we classified the
song as belonging to that style, if not, the song was unclassified. In figures 6,7, and 8 can
be seen that most Merengues has maximums in the 1st and 5th bins, most Salsas has
maximums in the 4th and 7th bins, and most Cumbias has maximums in the 5th and 7th
bins. Recall that for Salsa and Cumbia we did not consider the 1st bin as criteria for
classification.
Figure 6. Examples of histograms for the bass line for Merengues. Most Merengues have
maximums in the 1st and 5th bins.
Figure 7. Examples of histograms for the bass line for Salsas. Most Salsas have
maximums in the 4th and 7th bins.
Figure 8. Examples of histograms for the bass line for Cumbias. Most Cumbias have
maximums in the 5th and 7th bins.
To merge the results of both analysis we make a logical AND of the three results. In other
words, if and only if bass and conga methods return the same style, and the tempo is
between the boundaries for that style, is that we would recognize the song as belonging to
the style.
Results
Here are the results obtained independently by the classifier based on the bass, conga and
tempo:
CUMBIAS
SONG
Ano Viejo
Amor Prohibido
Mosaico de Billos
La Canoa
El Colesterol
Colombia
Mosaico de Cumbia
Dos Mujeres
Piel Morena
BASS
Not recognized
Cumbia
Cumbia
Cumbia
Cumbia
Cumbia
Cumbia
Cumbia
Cumbia
CONGA
Cumbia
Cumbia
Cumbia
Cumbia
Cumbia
Cumbia
Cumbia
Cumbia
Cumbia
TEMPO
ok
ok
ok
ok
ok
ok
ok
ok
ok
MERENGUES
SONG
Cachamba
Dame de eso
Dejala
Don Goyo
El Venao
Everybody
Fiesta Caliente
La Bomba
Llorar
Por un mani
La Morena
El Tiburón
BASS
Merengue
Merengue
Merengue
Merengue
Merengue
Merengue
Merengue
Merengue
Merengue
Merengue
Merengue
Merengue
CONGA
Merengue
Merengue
Merengue
Merengue
Merengue
Merengue
Merengue
Merengue
Merengue
Merengue
Merengue
Merengue
TEMPO
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
CONGA
Merengue
Salsa
Salsa
Salsa
Merengue
Salsa
Salsa
Salsa
Salsa
Salsa
Salsa
Salsa
TEMPO
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
SALSA
SONG
Abriendo Puertas
Amiga
Amores como el n...
Ayer
Carnaval
Conciencia
Hasta el Sol de Hoy
Juliana
La Cita
Lloraras
No vale la pena
Yamulemaue
BASS
Salsa
Salsa & Merengue
Salsa
Salsa
Salsa
Salsa
Salsa
Salsa
Salsa
Salsa
Salsa
Salsa
As it can be seen, only in 4 from 33 songs our program failed. For “Ano Viejo” the bass
failed because about half the song the bass is playing a very strange line. I have never
seen a bass line like that in Cumbia. It is playing the same rhythmical pattern that the
piano use to play, what is very uncommon.
In “Amiga” the bass failed basically because of a resolution problem. It turns out that
some people record MIDI files in a non-traditional tempo. In this case, the song was
recorded at 160 quarters per minute but the beat is carried by other figure. This is related
with a very common interpretation problem of music notation, but it is out of the scope of
this paper. Applying certain rules respect concordance between number of measures ant
the histogram the problem can be solved. Actually, this solution is implemented in the
method that is used for the conga.
For “Carnaval” I expected problems because this song is not well defined. Half the song
it seems to be Cumbia and half the song seems to be Merengue. A surprise was to see that
the conga classified it as Merengue. This was completely unexpected.
Further Work
The next steps in this project will be to analyze other instruments. A traditional
instrument in all of these styles is the piano. However, templates for this instrument have
much more variance than for bass and congas, so are more difficult to define.
Another goal would be to expand the scope of this project to audio files. The reason is
that usually the lowest frequency components in an audio file are given by the bass, so it
would be possible to extract the bass line and convert it into a MIDI file. As we obtained
relative good results based only on the bass, it is expected that if the conversion audioMIDI is done correctly enough, the system will have success.
Download