Music Therapy on Arousal Zerui Ma, Shihong Ling, Shijie Zhou 1. Introduction Music can influence people’s emotions in various ways. As music lovers, we found music therapy an interesting realm that takes advantage of these influences. Therefore, we would like to explore more on how music therapy works. To narrow it down, one of the goals of music therapy is to decrease patients’ stress, so we decided to focus on stress reduction. 1.1 Background As we did background research on stress reduction in music therapy, we found that music preference contributes a lot to stress reduction. As one paper states, "Music preference plays a critical role when music is used for stress reduction … This may be due to the fact that music preference is positively correlated with the intensity of felt happiness (Hunter et al., 2010, Kreutz et al., 2008) and peace (Kreutz et al., 2008)."[1] However, music preference is not always effective in all scenarios. As another paper states, “Subject preferred music is too distracting and therefore stimulates the subject rather than increasing relaxation."[2] Both statements are convincing, which made us confused on the role played by music preference in stress reduction. 1.2 Problem and Hypothesis Stress is often caused by high arousal. Therefore, to solve our confusion, we would like to find the answer to the problem “How important is music preference in arousal reduction?” Since we acknowledge the critical role of music preference, we have the following hypothesis: Subjects who listen to their preferred music will have greater reduction in arousal than subjects who listen to music with the lowest arousal value. In experiment setting, the independent variable is the subject’s music preference. The dependent variable is the subject's reduction in arousal. Since the result could be influenced by the subject's initial emotional state, we would set the initial emotional state as the control variable. 1 2. System Design Our experiment system consists on three components: 1) Generate model to predict arousal values of music through machine learning. 2) Divide music into categories and use our model to compute the piece of music with the lowest arousal value within each category. 3) Evaluate the subject’s arousal through self-report surveys in pre-test and post-test with intervention. Each member in our team is responsible for one of the components above, so here is our task division: 1) Shihong Ling: Machine learning model generation. 2) Shijie Zhou: Music data collection and processing. 3) Zerui Ma: Experiment design, analysis and framework building. 3. Music Arousal Predicting Model In this section, we first find a wonderful music information dataset which contains detailed analysis for 1000 songs. Then we select the part that we want to use as features and labels from the dataset for further model training. Next, we build a stacked machine learning model to predict arousal value for music using several famous good regression algorithms. Finally, we evaluate our model using root measured square error and use it in later experiments. 3.1 Dataset Selection - The PMEmo Dataset The dataset we choose is from a music emotion recognition paper , “The PMEmo Dataset for Music Emotion Recognition” [3]. It has detailed information for 1000 pop songs and can be divided into eight parts: metadata, which shows the basic information for each song such as song id, artist name, etc; chorus, which are the music clips for the pop songs; annotations, which represents emotion for each song in terms of arousal and valence; EDA, which is the electrodermal activity data for listening each song; comments, which are from the both Chinese and English users in NetEaseMusic and SoundCloud; features, which are pre-computed acoustic features for each song; lyrics, which are lyrics files of songs (Figure 1). 2 Figure 1 The reason why we think PMEmo dataset is wonderful for our research is not only that this dataset provides so much information which can be used as training features but also well-defined arousal values as labels. The correctness of labels are very important in training models, if the labels are wrong, the training model will definitely perform badly correspondingly. Our ideal dataset must have reliable arousal values no matter how much other information it has (even though it just has a few features, we can do feature extraction to produce more). We believe that this dataset’s arousal values are reliable because its producers take a complete analysis for each song’s emotion and then transfer into arousal and valence values. When we talk about the emotion of a song, we need to consider the melody (acoustic features, the most important part for any types of song), listener’s physical reaction after listening important (we think this is the most accurate sign to show a song’s emotion since lyrics or comments may lie 3 but human body does not), lyrics, and comments. This dataset contains all this part to get the arousal values, thus we are sure it is what we want for our research. 3.2 Feature Selection - Static Features The label of our training dataset is clearly the arousal value, and we also need to decide what kind of features we will use. After consideration, we decide to use acoustic features only and give up lyrics, comments, and electrodermal data. The reason why we keep acoustic features is because they are the most important and available part in music. No matter whether a song has lyrics, comments, or listeners, we can always extract some information from its melody. In fact, for lyrics, we are not just throwing them out. Instead, we include them as part of the acoustic features: for those songs which have lyrics, we add extra features; otherwise, those extra features have “None” values. Through this way, we expect to reduce the impact from lyrics in prediction (to be fair for all songs in experiments since not all of them have lyrics) without totally ignoring lyrics. For comments, as we have mentioned before, comments may lie. So unless we can do a “smart” research on enough comments from different reliable resources, we won’t include comments in our research. Unfortunately, considering workload and time cost, we have to throw them out. As for electrodermal data, since we don’t have a device to measure listeners’ physical situations during experiments, we simply ignore it. After deciding what kind of features we want, we need to define a way to extract. Here, we use a software tool called Opensmile. Opensmile is an efficient tool for music analysis. It provides many extracting configurations which help users to extract needed features precisely. Based on its configuration, we define static features for our further training. Static feature set has a 6373-dimension scale. This set is defined by the 2013 Computational Paralinguistics Evaluation (ComParE) Campaign [4]. In this set, the key factors are frame-wise acoustic low-level descriptors including MFCCs, energy, etc. These descriptors are very relevant for both Music Information Retrieval and other general sound analysis. And by using Opensmile (since the Opensmile has special configurations for ComParE), we can conveniently extract such static features out. After clarifying the features and labels we want, we split the PMEmo data to build a training data set for our models. 4 3.3 Model Design and Evaluation When we design our machine learning models, we first choose several regression models based on our knowledge and experience. Our chosen models are lasso regression, elastic net regression, ridge regression, k-nearest neighbor regression, support vector regression, decision tree regression, and random forest regression. These regression algorithms are generally good and have a lot of applications in real life. But whether they are good for our research needs to be further tuned and evaluated. The way that we evaluate our models is to use root measured square error. We first evaluate the models with default parameters (parameters are set up automatically by method) using the test set (or validation set) split out from the training set. And we get following result (Table 1): Model Lasso Elastic Ridge kNN SVR DT RF RMSE 0.184 0.184 0.140 0.136 0.227 0.131 0.155 Table 1 Considering test error is generally worse than training error, we expect that the baseline for models’ root measured square error should be 0.1. However, none of the models achieve our request. In order to improve their performances, we use GridSearchCV algorithm to do hyperparameter tuning for each model. After many tries, we finally get better models with appropriate parameters (Table 2): Model Lasso (optimize alpha) Elastic (optimize alpha) Ridge (optimize alpha) kNN (optimize k neighbors) SVR-rbf (optimize kernel) DT (optimize random state) RF (optimize random state) RMSE 0.119 0.119 0.098 0.121 0.110 0.122 0.135 Table 2 After hyperparameter tuning, our models have better performances. However, we still want to improve. Finally we decide to use the stack method to build a better regressor based on the models we have. Using the stack method can not only improve the performance but also prevent the potential overfitting or underfitting while training. We stack the top four models with parameters and successfully build a better stacked model whose root measured square error is only 0.072. This is our final model which will be used in later experiments. 5 4. Experiment Construction In this section, we will introduce detailed experiment designs, including music selection, experiment procedure, and our experiment system. 4.1 Music Selection To test the effect of music preference, the first step is to define the categories of music that the subject could choose from. Here, we had two options of categorization: 1) Use genres of music like jazz, rock, and pop. 2) Use the adjectives that could characterize music like epic, romantic, and funny. Option 1 seems more intuitive since this is how music is categorized in our daily life. However, in the field of music therapy, option 2 seems to be a better choice. First of all, in music therapy, we are focused on the subject’s emotional state. Compared with genres of music, those adjectives are more closely associated with emotions. As the subject sees the adjectives, he/she would soon know the emotions expressed by the music. Moreover, the arousal values within each genre of music could vary greatly. For example, some pop music is loud and exciting while some other pop music is soft and soothing. In this case, the average arousal values for each genre are not very useful for our experiment. Using the adjectives, on the other hand, could ensure that the variance of arousal values within each category is relatively small. For these two reasons, we chose option 2 to categorize the music. On FESLIYANSTUDIOS, a royalty free music website, we found the categorization of music that we need. Since the music would be used for music therapy, we excluded some inappropriate categories like Scary and Suspenseful. The final categories that we picked were Epic, Funny, Happy, Motivating & Inspiring, Peaceful & Relaxing, and Romantic. For each category, 10 to 20 pieces of music were chosen. Then, we separated 30-second slices from each piece of music and used the model in section 3 to predict their arousal values. The music category assigned by the website might not perfectly match the testing person’s opinion. So we made some effort to reduce the probability of disagreement on music type. During the selection part, we preferred those music belonging to only one category, and we manually checked the features of music we picked. Also, our main focus is the arousal value of each music piece. If the subjects don’t agree with the category tag, as long as the arousal value is low, the experiment result won’t be affected a lot. Here are the arousal values for six different categories of music. 6 Figure 2: Arousal values of Epic music Figure 3: Arousal values of Funny music 7 Figure 4: Arousal values of Happy music Figure 5: Arousal values of Motivating & Inspiring music 8 Figure 6: Arousal values of Peaceful & Relaxing music Figure 7: Arousal values of Romantic music For the above six figures, the x-axis represents the #id of our music slices, the y-axis represents the arousal value achieved using our model trained in section 3. Each red star is one slice of 9 music we would use in the experiment. The #id number of the 30-seconds slices were generated randomly. Since the number of music varies among six categories, and different music has different length, the sample size of different categories could have a large divergence. We then calculated the average arousal value for six categories separately. Category Epic Funny Happy Motivati ng & Inspiring Peaceful & Relaxing Romantic Number of slices 61 56 92 128 104 64 Average arousal value 0.484900 0.428562 0.406093 0.440364 0.138315 0.243290 Standard Deviation 0.116542 0.171064 0.150540 0.157373 0.126787 0.126812 Table 3: Average arousal values for six different categories Figure 8: Averaged arousal values for six different categories From the results above, we can see that Peaceful & Relaxing music has the lowest average arousal value and Epic music has the highest average arousal value. The standard deviations of Epic music, Peaceful music and romantic music are relatively low compared to funny music. Therefore, we selected one piece of music from each category with the lowest arousal value and used them in the intervention phase. More details will be covered in the following section. 10 4.2 Detailed Experiment Design We designed a between-subjects experiment and collected data from 18 subjects online. These subjects are divided evenly into the control group (N = 9) and the experimental group (N = 9). In the introduction of our experiment, the subjects were only informed that we were studying the effect of music therapy, so they did not know the hypothesis. Next, the subjects would go through pre-test, intervention, and post-test. The following are detailed steps in each phase. Phase 1: Pre-test a. We would use Affective Slider to evaluate subjects’ arousal. In our system, their choices would be mapped on the scale [0, 1] (Figure 9). b. We would use PANAS-SF to evaluate subjects’ emotional states. Based on their answers to the 20 questions, the system would calculate their Positive and Negative Affect Scores (Figure 10). c. Subjects would indicate their music preferences from the set {Epic, Funny, Happy, Motivating & Inspiring, Peaceful & Relaxing, Romantic} (Figure 11). Figure 9 11 Figure 10 Figure 11 Phase 2: Intervention Subjects in the experimental group would listen to a single piece of music with the lowest arousal value in their preferred categories. Subjects in the control group would listen to a single piece of music with the lowest arousal value in all categories, so they would always listen to Peaceful & Relaxing music. All pieces of music we selected are pure music that does not have lyrics. After listening to music, subjects would proceed to post-test. They did not need to rate the mood expressed by the music. 12 Phase 3: Post-test a. We would use Affective Slider to evaluate subjects’ arousal after listening to music. b. We would use PANAS-SF to evaluate subjects’ emotional states after listening to music. 5. Results and Analysis Figure 12 shows the changes in valence and arousal for 9 subjects in the control group (blue data points: pre-test / orange data points: post-test.). Figure 13 shows the changes in valence and arousal for 9 subjects in the experimental group (blue data points: pre-test / orange data points: post-test.). Table 4 shows the statistics on increase in valence and reduction in arousal for both groups. Table 5 shows the t-values and p-values after running t-tests on increase in valence and reduction in arousal. As well can see, compared with the experimental group, the control group has greater increase in valence and greater reduction in arousal, but the results are not significant as indicated by large p-values. Initial valence and arousal may correlate with increase in valence and reduction in arousal, so we calculated Pearson’s correlations for all conditions (Table 6). Among them, we observed high positive correlation between initial arousal and reduction in arousal in the control group, as well as high negative correlation between initial valence and increase in valence in the experimental group. These results suggest that increase in valence is negatively correlated with initial valence if subjects listened to the music they preferred, and reduction in arousal is positively correlated with initial arousal if subjects did not listen to the music they preferred. This means that initial valence and arousal may also contribute to change in valence and arousal, but the sample size may be too small to reach high significance. We are also interested in whether the subjects’ preferred categories are correlated with their initial valence and arousal, so we showed subjects’ initial valence and arousal along with their preferred categories on a scatter plot (Figure 14). As we can see, the distributions of preferred categories are somewhat random, so subjects’ preferred categories are not correlated with their initial valence and arousal. 13 Figure 12 14 Figure 13 Average Standard Deviation Control group: Increase in valence 0.0333 0.0205 Experimental group: Increase in valence 0.0189 0.0456 Control group: Reduction in arousal 0.0689 0.0300 Experimental group: Reduction in arousal 0.0544 0.0566 Table 4 15 t-value p-value Increase in valence 0.8173 0.4258 Reduction in arousal 0.6380 0.5325 Table 5 Pearson’s correlation Control group: Initial valence vs. Increase in valence Experimental group: Initial valence vs. Increase in valence Control group: Initial arousal vs. Reduction in arousal Experimental group: Initial arousal vs. Reduction in arousal -0.4202 -0.7779 0.9428 0.3944 Table 6 Figure 14 16 6. Conclusion According to our data and analysis, our hypothesis that “Subjects who listen to their preferred music will have greater reduction in arousal than subjects who listen to music with the lowest arousal value” may not be valid. This could be caused by small sample size. Interestingly, as we examined the data, we observed two subjects in the experimental group who had increases, instead of decreases, in arousal after listening to preferred music. One of them preferred Happy music, another subject preferred Funny music. This may provide evidence for the claim that “Subject preferred music is too distracting and therefore stimulates the subject rather than increasing relaxation." We also discovered that increase in valence is negatively correlated with initial valence if subjects listened to the music they preferred, and reduction in arousal is positively correlated with initial arousal if subjects did not listen to the music they preferred. We are not clear about its cause, but we guess that higher initial arousal would make space for greater reduction in arousal. Similarly, higher initial valence would limit the space for increase in valence. In future studies, we hope to recruit more subjects so that we can yield conclusions with high significance. At the same time, we will try to use Option 1 in section 4.1 to categorize our music to see whether different categorizations have different effects. Moreover, we will collect more pieces of music in our database and learn about more research in the field of music therapy. Once we have considerable amounts of music and solid theoretical backgrounds, our experiment system could be easily transformed into an application useful for conducting music therapy. 17 Reference [1] Jiang, J., Rickson, D., Jiang, C., The mechanism of music for reducing psychological stress: Music preference as a mediator, The Arts in Psychotherapy, Volume 48, 2016, Pages 62-68, https://doi.org/10.1016/j.aip.2016.02.002 [2] Cori L. Pelletier, MM, MT-BC, The Effect of Music on Decreasing Arousal Due to Stress: A Meta-Analysis, Journal of Music Therapy, Volume 41, Issue 3, Fall 2004, Pages 192–214, https://doi.org/10.1093/jmt/41.3.192 [3] Zhang, K., Zhang, H., Li, S., Yang, C., & Sun, L. (2018). The PMEmo Dataset for Music Emotion Recognition. Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. doi:10.1145/3206025.3206037 [4] Björn Schuller, Stefan Steidl, Anton Batliner, Alessandro Vinciarelli, Klaus Scherer, Fabien Ringeval, Mohamed Chetouani, Felix Weninger, Florian Eyben, Erik Marchi, et al. 2013. The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In Proceedings INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, Lyon, France 18