NOTE RECOGNITION IN INDIAN CLASSICAL MUSIC USING NEURAL NETWORKS Authors : Amit Bose and Ashutosh Bapat Affiliation : Government College of Engineering, Pune. Abstract Listening and understanding classical music is one of the finest qualities gifted to human beings. The process of understanding starts with recognizing the notes in classical music. This paper proposes a model for extracting notes present in a given presentation of music. The first layer extracts the frequency content present while the second matches this with already learnt patterns. The correctness of the model is supported by results from various tests conducted. Observations regarding clustering of neurons and effects of variation in parameters are also discussed. At the end of the paper the stepping-stones towards an ‘intelligent machine singer’ are conjectured. 1. Introduction The representation of Raag in Indian classical music is a sequence of notes. The sequence in which notes are sung obeys certain rules associated with the Raag. To identify a Raag from a piece of music, we thus need to first identify the notes that are present in the piece. Then we must find which rule of Raag is satisfied by the particular sequence of notes that have been previously identified. Although this manner of Raag identification presents a very simplistic view of things, it outlines the essence of the identification process. Our paper concentrates on the first part of this task, namely identification of the notes of classical music present in a given piece of music. In particular, the paper concentrates on identifying the basic notes. We describe here the implementation of the neural network based model for this task that we have built and also discuss the results thereof. Neural networks a tool for identification The choice of neural networks as the method for identification of notes comes naturally if one considers the human analogy. Music, whether be it for listening, enjoying, composing or understanding, is considered to be one of those defining qualities that distinguishes humans from machines. On the other hand, efforts in Artificial Intelligence seek to develop human-like abilities in machines, computer programs in particular. Artificial Neural networks form an AI tool that is inspired by the biological neural network present in the human brain [1]. It is believed that human beings learn by training of the neurons present in the brain. The same also applies to human recognition of music: humans train themselves to identify music by repetitively exposing themselves to music and trying to extract patterns that would represent the information contained therein. In fact, this is believed to be the technique used by the human brain for identifying visual images and sensation of touch also [2]. Thus it would seem that identification of musical notes and eventually of Raag is a form of pattern recognition - pattern recognition in music. A lot of work has already been done in the use of neural networks for pattern recognition. Thus we felt that neural networks presented the right kind of methodology for the task at hand. The model of identification that we have built is based on unsupervised learning. This is necessary because the classification of various patterns is obviously an unsupervised process. This is also how humans learn to identify music: a student of music initially is not sure how a piece of music will finally be classified as having a particular set of notes. The process of differentiating between notes is guided mainly by his/her ability to extract and recognize patterns and such ability is usually developed over time with practice. The identification process only could thus be described as unaided or unsupervised, although for naming the patterns as Sa, Re etc. supervision is necessary. 2. Unsupervised or competitive learning Artificial Neural Network is a system loosely modeled on the human brain. It is an attempt to simulate within specialized hardware or sophisticated software, the multiple layers of simple processing elements called neurons. Each neuron is linked to certain of its neighbors with varying coefficients of connectivity that represent the strengths of these connections. Learning is accomplished by adjusting these strengths to cause the overall network to output appropriate results. The coefficients of connectivity are often referred to as weights in literature; we shall conform to the same. Basically, all artificial neural networks have a similar structure of topology [1]. Some of the neurons interface the real world to receive its inputs (the input layer) and other neurons provide the real world with the network’s outputs (the output layer). All the rest of the neurons are hidden form view (the hidden layer(s)). The learning ability of a neural network is determined by its architecture and by the algorithmic method chosen for training. There are three common kinds of learning [2]: • Supervised Learning, where the system is trained by giving it the correct answers to the questions. The system is taught the right way to do something. • Reinforcement Learning is a special kind of supervised learning where the system is trained only by telling the system how good its responses were but not telling it what the correct solutions are. • Unsupervised Learning, where no information is given to the system. In unsupervised learning, the objective is not to learn the correct answer, but to classify the answer internally. For example, to cluster all the inputs to the system into 6 categories, the system has to figure out how to box the inputs into these 6 categories as best as possible. What those categories are is left unspecified. A popular approach to unsupervised learning is competitive learning [2], where neurons fight amongst themselves to decide what category a particular input belongs in. The neurons closest to the input are the strongest, and they will ultimately win the contest. The learning rule for this network only changes the weights going from the input to the winner neuron. The weights are adjusted so that this input makes the winner neuron even stronger than it was. Thus the weights are moved a little away from their previous settings and towards the input values. The degree to which this shift occurs is controlled by a learning-constant. In order that neurons fight for supremacy, the neurons usually have inhibitory weights among themselves. A kind of lateral inhibition, which only applies to the nearest neighbors of the neuron, results in a dynamic known as feature mapping. The basic idea here is to build an architecture where a neuron will help its nearby friends more than neurons further away. The networks then not only learn to cluster inputs into categories, but categories that are spatially related to each other are clustered by neurons closer to one another. This spatial distribution is called a self-organizing map. A procedure, developed by Teuvo Kohonen [3], does a remarkable job of spatially dividing up the input data into clusters in the output. The procedure is called the Kohonen algorithm. 3. Indian Classical Music The following points regarding Indian Classical Music are relevant to the model [4]: • In Indian classical music, note (swar) is defined as melodious sound having a particular frequency. In a saptak there are 7 shuddha (pure) notes named as Sa, Re, Ga ... Nee. Sa to Nee forms a saptak while Saa starts of the next saptak. • In all 22 notes (including 7 shuddha notes) are defined in a saptak. They are called shrutees. The shrutees are the transitional frequencies that can be sung between shuddha notes. • Sa is used as base or datum for singing. The Saa has a frequency twice that of Sa. • The shuddha notes roughly follow a geometric progression. Thus ratio of two 1/7 successive notes is about 2 . • Since notes are frequencies, the Fourier Transform of utterance of a note shows a definite pattern as shown in Fig 1. This pattern can be used to recognize a note sung. Fig 1: Amplitude spectrum of a note. • Mr. Bhatkhande, assuming Sa as 240Hz gives the actual frequencies of all notes including shrutees [4]. • There are 3 saptaks 1) Mandra saptak consists of notes with lower frequencies 2) Madhya saptak consists of notes with medium frequencies 3) Taar saptak consists of notes with higher frequencies • Normally humans can sing from Ma in Mandra to Pa in Taar. This gives us lower and higher bounds on the frequencies that can be sung. These approximate to 150Hz to 500Hz respectively. 4. Neural network model The structure of the proposed model is shown in Fig 2. Fig 2: Block diagram of the note recognition system. 4.1 Input The input to the system is digitized voice in the form of a WAVE (PCM format) file (*.wav). The file is broken into small pieces such that each piece possibly contains a single note. The size of piece depends upon: 1. sampling period of the wave file 2. minimum period for which a note is sung. To make the model generalized it is needed to take into consideration the smallest interval of time for which a singer sustains a frequency, which we have taken arbitrarily as 0.09s. This has given good results. 4.2 Layer 1 A piece of the input wave file is input to the layer at a time. The layer consists of N neurons each resonating with a single frequency. During training the neurons form clusters, each cluster resonating with a note; in more general sense, a frequency. 4.3 Output of Layer 1 The output of the Layer 1 is set of normalized amplitudes of Fourier Transform of the piece calculated at the resonant frequency of each neuron in Layer 1. Along with this, the layer also supplies the winning neuron’s index and its resonant frequency. 4.4 Layer 2 Layer 2 is a pattern-matching layer in which each neuron tries to match itself to the input Fourier Transform pattern. The neuron with the closest match determines the note present in the piece. 4.5 Output The output is the sequence of notes present in the wave file. 5. Layer 1 : Pattern Generator The structure of Layer 1 is based roughly on the Kohonen model. It consists of N neurons arranged in a linear topology. This layer acts as the input layer and so each neuron receives input from the external world. This input is nothing but the pieces of input wave file fed one at a time. Since there is just a single input, each neuron has only one weight associated with it. This weight is actually the frequency that has been assigned to the neuron, also called the resonant frequency. When the Layer 1 neurons are presented with a particular piece of wave-file, each neuron calculates the Fourier Transform of the piece at its resonant frequency. The amplitude of the transform then becomes the activation level of each neuron. If x(n), 0 ≤ n < P denotes the samples of the input piece and Fi denotes the resonant P−1 Ai = ∑ x(n) ⋅ e − j 2π FiT n n=0 th frequency of the i neuron in Layer 1, then its activation level is given as [5], where T is the sampling period of the signal. The neuron with the largest activation level is declared the winner i*, that is Ai* ≥ Ai for i = 0 to P -1 Humans usually pick up frequencies in the range of 150Hz to 500Hz while singing. Let Fmin and Fmax denote minimum and maximum frequencies that demarcate this range respectively. Initially the Layer 1 neurons are assigned values of frequency within this range in the ascending order. Thus the step in assignment of frequencies is given by ∆F = Fmax - Fmin N When Fourier Transform of a piece is calculated at each neuron, since each neuron has a different frequency, the transform gets calculated for a number of frequencies. Effectively this results in a highly discrete Fourier Transform. The neuron whose frequency is closest to the dominant frequency in the piece will have the maximum value of activation level and will thus be declared the winner. However the winner may not always be sufficiently close to the actual dominant frequency. Hence the winner neuron is allowed to modify some of its neighboring neurons in the hope of reaching sufficiently close to the actual frequency [3]. The neurons that constitute the neighborhood of the winner satisfy a criterion based on their frequency: the difference in the frequency of the winner and of each neighbor falls within a threshold. This threshold value must depend on ∆F and hence on N, if we want the neighborhood to have at least a single neuron. With N=100, ∆F = 3.5Hz so that a threshold of 5Hz may be deemed suitable. The neighborhood neurons are modified so that their frequency comes closer to that of the winner. The change in frequency of neighboring neurons is given as, δ Fi = α ⋅ Fi* − Fi which is added to Fi. Here α is the learning constant for Layer 1. It takes values between 0 and 1, and controls the rate at which the neighboring neurons approach the winner neuron. The activation levels of these modified neurons are recalculated and again a winner is chosen. The process of modifying the neighborhood is repeatedly performed till the frequencies of the two extreme neighbors of the winner do not differ a lot. The effect of this kind of modification is that the winner strengthens its neighborhood while strengthening itself in return. As a result a set of neurons of Layer 1 tune themselves a particular range of frequencies that correspond to a note; minor variations in the frequency of the note being sung do not change the part of the Layer 1 that wins for the particular note. This leads to a feature map in Layer 1 with each note causing only a particular set of neurons to win. The neurons of Layer 1 thus cluster around a particular note. When a human sings a particular note, in addition to the correct frequency for that note, the musical piece also contains overtones of the correct frequency. These overtones have large amplitude of Fourier Transform that is sometimes comparable to the Fourier Transform of the correct frequency, as shown in Fig 3. Thus in some cases the neurons resonating at overtones overwhelm the correct frequency neurons and thereby turn out be the winners. This also happens when the correct frequency is not already present at any neuron, but the overtone is. Actual note Overtone Fig 3: Overtone having larger amplitude than actual frequency. It is not possible to determine directly from a winner if it is resonating at an overtone. Hence a heuristic is used here to overcome the difficulty. If a winner has frequency Fi* and ½(Fi*) is greater than Fmin, it is assumed that the winner may be an overtone. In that case, a neuron is found in the layer whose frequency is closest to ½(Fi*). That neuron is then assigned the assumed correct frequency of ½(Fi*). In addition, this neuron is also given a chance to modify its neighbors in the hope that the correct frequency will be assigned to a neuron and that neuron will come to prevail. Even if this does not happen and the overtone wins ultimately, the neighborhood of the assumed correct frequency will be strengthened. This will help Layer 2 in note identification. Due to this heuristic, however, the determination of legitimate frequencies, whose half is greater than ½(Fmin), will not be hampered. This is because the half-frequencies in such case usually have poor activation levels. Hence the frequency that is actually contained in the piece usually emerges as the winner. Fig 4: Inter-connection between Layer 1 and Layer 2. 6. Layer 2 : Pattern Matcher Layer 2 is a simple competitive learning neural network based on Euclidean distance. It has L neurons with the value of L depending on the number of notes that we wish to identify. Roughly, it is twice the number of notes that we wish to identify, in order to allow the transition between any two notes to be identified. Each neuron of Layer 1 is connected to the each neuron of Layer 2, as shown in Fig 4, with the activation levels of Layer 1 neurons acting as input to Layer 2 neurons. The activation levels are first normalized using the activation level of the Layer 1 winner before being fed to Layer 2. Each Layer 2 neuron j has a set of weights Wj such that one weight is associated with one Layer 1 neuron output. Since the overall purpose of Layer 2 is to assign a unique neuron in Layer 2 for each type of input pattern of activation levels, the weight vector Wj associated with a neuron represents the pattern of activation levels that it recognizes. Hence the weight vector is also referred to as prototype in literature. In order to determine how similar a particular input pattern is to the weight vector of neurons, a distance measure is used. Though such a neural network is sensitive to the choice of distance metric used, we have used the Euclidean distance measure for Layer 2 because it is the one that is used most often [6]. The Euclidean distance between weight vector Wj and input vector of activation levels is N d (W , A) = ∑ ( Ai −W j, i )2 j i =1 where Wj,i is weight associated with activation level Ai for Layer 2 neuron j. The winner neuron in the Layer 2 must be the neuron whose weight vector is as near as possible to the input pattern. Hence the neuron j* is declared the winner if d(Wj*, A) is minimum, that is d (W , A) ≤ d (W , A) for j = 1 to L j* j Connection weights of the winner neuron, j*, are modified so that the weight vector Wj* moves closer to A, while the other weights remain unchanged. This is the so-called “winner-takes-all” kind of modification. The weight update rule is thus given as, W j* = W j* + η ⋅ ( A − W j* ) where η is the learning constant. The learning constant takes values between 0 and 1, and determines the how quickly the values of the weight vector converge, i.e. reach a steady value. Usually the learning constant is a non-increasing function of time. With this, later corrections in weights are smaller, resulting in faster convergence of the network [6]. In our model, η is decreased gradually with each presentation of the training set. Each presentation of the activation levels for all pieces in the input wave file is called an epoch. In training mode, η has large value (say 0.5) for the first epoch and the value reduces by 25% at the end of each epoch. The epochs are discontinued once η becomes very small (say falls below 0.1). In validation or use mode, η starts with a slightly lower value (say 0.25). The result of passing the Layer 2 through a number of epochs is that particular cell or cells in Layer 2 develop weight vectors that are very close to the activation level patterns for particular notes. Such cells turn out be the winners when corresponding notes are present and hence such cells start “recognizing” particular notes of music. The note that is associated with the Layer 2 winner for a particular piece becomes the note identified by the model and thus the model is able to recognize notes of classical music present in musical piece. When operated in use-mode, the weight vectors are initialized with values that were obtained during training sessions. During training, however, the initial value of the weight vectors of Layer 2 neurons is set to 0. Usually the initial values in an unsupervised network are set randomly [6]. This is done so that the network does not have any bias towards particular patterns. Moreover, using neurons whose initial weight vectors are almost identical sometimes results in splitting a cluster into number of parts. Improper choice of initial values can also lead to “dead cells” [6]. However, in our case, initializing the neuron weight vectors to 0 gave better results than random initialization. It has been mentioned earlier that many times Layer 1 throws up a winner neuron that is actually resonating at overtone of the correct frequency. To overcome this, the neighborhood of the half-frequency in Layer 1 is strengthened. However, in many pieces even this is not sufficient to bring out the correct frequency and so the overtone remains the dominant frequency detected by Layer 1. But this anomaly does not create a problem to Layer 2. Due the strengthening of the half-frequency neighborhood, such an activation level pattern is closer to the weight vector of the neuron that recognizes the activation level pattern for correct frequency as compared to that of the neuron that recognizes the activation level pattern for the overtone. That is, if Wj and Wk are Layer 2 neurons which recognize the correct and the overtone frequencies respectively, then d(Wj, A) < d(Wk, A). As a result, even though identification of dominant frequency fails, classification of the activation level pattern into appropriate category is done correctly. The results confirm this surmise. The process of weight adjustment described above brings the winning neuron’s weight vector closer to those training patterns that fall within its zone of attraction. In fact, when further changes in weights are insignificant with each training pattern presentation, each weight vector converges to the average of all those input patterns for which that neuron was the winner [6]. This fact is extended to assign notes to the neurons of Layer 2. Thus a representative frequency is calculated for each neuron by taking the average of dominant frequencies of those pieces for which the neuron was the winner. But because even the overtone frequencies get correctly mapped to proper neurons in Layer 2, a direct calculation of average would not give the appropriate representative frequency. To overcome this difficulty, one must use half of the dominant frequency for calculation of average, in case the dominant is an overtone. However distinguishing an overtone from a genuine frequency in the same range is not obvious. So whenever the dominant frequency of a piece Fi* satisfies the condition Fmin < ½(Fi*), the weight vector of winning neuron is examined. The weight of connections to Layer 1 neighborhood that resonates at Fi* are compared with the weight of connections to Layer 1 neighborhood that resonates at ½(Fi*). If the values are comparable, it is inferred that the dominant frequency is an overtone and its half-frequency is taken for calculations; otherwise the dominant frequency is used directly. Musical notes are assigned to Layer 2 neurons depending on their representative frequency. Since the frequencies of basic notes roughly follow a geometric progression starting with the base-frequency (Sa), it is necessary to somehow get hold of the basefrequency value. Presently this value needs to be externally supplied to our model, but provision can be easily to determine the base-frequency using Layer 1. Given the basefrequency, the entire set of frequencies in the octave is calculated. For each note, the neuron whose representative frequency is closest to the calculated frequency is designated as recognizing the note. In this manner all notes are assigned to Layer 2 neurons. 7. Results The implementation is in the form of a software program written in C++ that accepts a digital representation of vocal sound and outputs the sequence of notes contained in it. The testing was carried out on Windows 2000 platform. 7.1 Mapping of notes: Fig 5 shows the following two plots: a. Frequency versus piece b. Neuron index in layer 2 versus piece The following observations can be made from these plots: 1. Each note gets mapped to at least one neuron. 2. Each neuron represents one and only one note. 3. Transitions get mapped to remaining neurons Table 1: Neuron indices and their representative frequencies. Neuron index 4 7 1 13 10 2 5 12 Note sa re ga ma pa dha nee saa Average frequency 205.2889 229.3519 249.8087 274.5542 306.4739 338.0687 378.2117 404.3073 Calculated frequency 205 226.3383503 249.8977991 275.9095395 304.6288293 336.337496 371.3467023 410 The table shows close agreement between calculated and representative average frequencies. 7.2 Verification of system Fig 6 shows the sequence of notes extracted from a piece of music used for verification. The actual sequence of the notes in this piece is as follows: Sa, sa re ga, ga ma pa, pa ma ga, ga ma pa, pa ma re ma ga Sa, sa re ga, ga ma pa, pa ma ga, ma pa dha dha, pa ma pa dha nee saa pa saa Dha, ma, dha, pa, ga, pa, ma, re, ma, ga, re, sa Layer 2 of the system was originally trained with the piece, which has been plotted in Fig 5. That piece contained all the shuddha notes sung in ascending and descending order. It can be seen from this figure that the sequence of notes identified by the system is in excellent agreement with that in the actual piece. Also the system is able to pick up the transitions that are inevitable while singing. 7.3 Variation of number of samples in a piece Fig 7 and Fig 8 were plotted by varying the number of samples of each piece being fed to Layer 1. The input used is same as that for Fig 5. We can see from these graphs that even though the dominant frequency given by Layer 1 differs quite a bit with variation in number of samples, the note identified by Layer 2 does not change much. But the following points need to be probably kept in mind: 1. Taking fewer samples may hamper the accuracy of Fourier Transform 2. Taking larger number of samples can lead to a piece with more than one note. 7.4 Variation of number of neurons in Layer 1 Fig 9 was plotted by varying the number of neurons N in Layer 1. The input is same as that for Fig 5. It can be observed from the above plots that two notes, Sa and Re, were assigned to a single neuron in first part and Re and Ga are assigned to single neuron in last part for N = 75. But the problem did not occur for N = 150. In general, as N becomes large, clusters of neurons in layer 1 become more and more sparse and thus mutually exclusive. But large N implies more calculation overhead. Thus a middle value like 100 is preferred for a single saptak. This allows a cluster of about 4/5 neurons for each of 22 shrutees. 7.5 Clustering of neurons in Layer 1 Fig 10 indicates the clustering of neurons in Layer 1 as they create neighborhoods that resonate with particular frequency ranges. The input is same as that for Fig 5. 8. Conclusion It is clear from the results that the neural network based model can be used for identification of notes in a piece of classical music. The close matching between the expected frequencies of notes and those associated with the Layer 2 neurons confirms the fact that this model is classifying the notes correctly. The degree to which the model is able to correctly identify notes depends on the ability of Layer1 to form clusters that resonate with a particular range of frequencies. The number of neurons in Layer 1 limits this in turn. Each time a person sings the base-frequency may be different and thereby the frequencies of other notes also change. Although this change may not be substantial, proper clustering of Layer 1 happens when it is allowed to train itself, even during use of the neural network. In addition, this is allows the system to be independent of the singer. In order to test the system, the pieces were chosen to have the following characteristics: • The least interval for which a note is sung is about 0.4 seconds. • The pieces contained only shuddha notes from a single saptak. • The pieces contained only pure human voice without any accompanying instrument. 9. Future Scope This system may be easily extended to a musical system including 3 saptaks with 22 shrutees in each. The next logical step, after thorough verification of this model, must be the identification of a complete Raag as outlined earlier. Probably, a grammar-based approach will be useful in that case. Even then, we think that a neural network would assist in dealing with subtleties. The model can be extended to include musical pieces that have instruments also. In fact the same technique could be used to discriminate between various instruments. The model proposed here for analysis of classical music should be treated as only a beginning. Recognition systems eventually move from analysis to synthesis. The same is applicable here. One can dream of a day when robots will sing in concerts like Sawai Gandharva and perform equally well as humans. 10. References [1] Daniel Klerfors: http://hem.hj.se/~de96klda/NeuralNetworks.htm (1998) [2] Sean Luke: http://cs.gmu.edu/~sean/cs687/neuronlecture3/Neural Networks Lecture 3: Hopfield and Kohonen Networks [3] Kohonen, T.: Self-Organization and Associative Memory. Springer-Verlag, Berlin (1989) [4] Godse M.: Sangeetshastra Parichay, pp. 25-49. Madhuranjan Prakashan, Pune (2002) [5] Proakis J.G., Manolakis D.G.: Digital Signal Processing: Principles, Algorithms and Applications, pp. 14-39, 230-294. Prentice-Hall of India, New Delhi (2000) [6] Mehrotra K., Mohan C.K., Ranka S.: Elements of Artificial Neural Networks, pp. 157-197. Penram International Publishing (India), Mumbai (1997) Frequency Cell index * 50 Frequency/ Neuron index * 50 700 600 500 400 300 200 100 0 Piece number Fig 5: Mapping between notes and Layer 2 neurons notes 8 Note index 7 6 5 4 3 2 1 0 -1 0 -2 50 100 150 200 250 300 350 400 450 Piece number 500 Note index 0 1 2 3 4 5 6 7 -1, 0.5 Note Sa Re Ga Ma Pa Dha Nee Saa Transition Fig 6: Note vs. Piece number in verification Frequency Neuron index * 50 Frequency/Neuron index * 50 700 600 500 400 300 200 100 0 Piece number Fig 7: Plot for 1024 samples per piece Frequency Neuron index * 50 Frequency/Neuron index * 50 700 600 500 400 300 200 100 0 Piece number Fig 8: Plot for 4096 samples per piece Freq N = 75 N = 150 Frequency / Neuron index * 50 700 600 500 400 300 200 100 0 Piece number Fig 9: Variations in number of Layer 1 neurons Unclusteered layer 1 Clustered layer 1 600 500 Frequency 400 300 200 100 0 0 20 40 60 Neuron index 80 Fig 10: Clustering of Layer 1 neurons 100 120