Uploaded by funemamrosi

Music Therapy on Arousal - Writeup

advertisement
Music Therapy on Arousal
Zerui Ma, Shihong Ling, Shijie Zhou
1. Introduction
Music can influence people’s emotions in various ways. As music lovers, we found music
therapy an interesting realm that takes advantage of these influences. Therefore, we would like to
explore more on how music therapy works. To narrow it down, one of the goals of music therapy
is to decrease patients’ stress, so we decided to focus on stress reduction.
1.1 Background
As we did background research on stress reduction in music therapy, we found that music
preference contributes a lot to stress reduction. As one paper states, "Music preference plays a
critical role when music is used for stress reduction … This may be due to the fact that music
preference is positively correlated with the intensity of felt happiness (Hunter et al., 2010, Kreutz
et al., 2008) and peace (Kreutz et al., 2008)."[1] However, music preference is not always
effective in all scenarios. As another paper states, “Subject preferred music is too distracting and
therefore stimulates the subject rather than increasing relaxation."[2] Both statements are
convincing, which made us confused on the role played by music preference in stress reduction.
1.2 Problem and Hypothesis
Stress is often caused by high arousal. Therefore, to solve our confusion, we would like to find
the answer to the problem “How important is music preference in arousal reduction?” Since we
acknowledge the critical role of music preference, we have the following hypothesis:
Subjects who listen to their preferred music will have greater reduction in arousal than subjects
who listen to music with the lowest arousal value.
In experiment setting, the independent variable is the subject’s music preference. The dependent
variable is the subject's reduction in arousal. Since the result could be influenced by the subject's
initial emotional state, we would set the initial emotional state as the control variable.
1
2. System Design
Our experiment system consists on three components:
1) Generate model to predict arousal values of music through machine learning.
2) Divide music into categories and use our model to compute the piece of music with the
lowest arousal value within each category.
3) Evaluate the subject’s arousal through self-report surveys in pre-test and post-test with
intervention.
Each member in our team is responsible for one of the components above, so here is our task
division:
1) Shihong Ling: Machine learning model generation.
2) Shijie Zhou: Music data collection and processing.
3) Zerui Ma: Experiment design, analysis and framework building.
3. Music Arousal Predicting Model
In this section, we first find a wonderful music information dataset which contains detailed
analysis for 1000 songs. Then we select the part that we want to use as features and labels from
the dataset for further model training. Next, we build a stacked machine learning model to
predict arousal value for music using several famous good regression algorithms. Finally, we
evaluate our model using root measured square error and use it in later experiments.
3.1 Dataset Selection - The PMEmo Dataset
The dataset we choose is from a music emotion recognition paper , “The PMEmo Dataset for
Music Emotion Recognition” [3]. It has detailed information for 1000 pop songs and can be
divided into eight parts: metadata, which shows the basic information for each song such as song
id, artist name, etc; chorus, which are the music clips for the pop songs; annotations, which
represents emotion for each song in terms of arousal and valence; EDA, which is the
electrodermal activity data for listening each song; comments, which are from the both Chinese
and English users in NetEaseMusic and SoundCloud; features, which are pre-computed acoustic
features for each song; lyrics, which are lyrics files of songs (Figure 1).
2
Figure 1
The reason why we think PMEmo dataset is wonderful for our research is not only that this
dataset provides so much information which can be used as training features but also
well-defined arousal values as labels. The correctness of labels are very important in training
models, if the labels are wrong, the training model will definitely perform badly correspondingly.
Our ideal dataset must have reliable arousal values no matter how much other information it has
(even though it just has a few features, we can do feature extraction to produce more).
We believe that this dataset’s arousal values are reliable because its producers take a complete
analysis for each song’s emotion and then transfer into arousal and valence values. When we talk
about the emotion of a song, we need to consider the melody (acoustic features, the most
important part for any types of song), listener’s physical reaction after listening important (we
think this is the most accurate sign to show a song’s emotion since lyrics or comments may lie
3
but human body does not), lyrics, and comments. This dataset contains all this part to get the
arousal values, thus we are sure it is what we want for our research.
3.2 Feature Selection - Static Features
The label of our training dataset is clearly the arousal value, and we also need to decide what
kind of features we will use. After consideration, we decide to use acoustic features only and
give up lyrics, comments, and electrodermal data.
The reason why we keep acoustic features is because they are the most important and available
part in music. No matter whether a song has lyrics, comments, or listeners, we can always extract
some information from its melody. In fact, for lyrics, we are not just throwing them out. Instead,
we include them as part of the acoustic features: for those songs which have lyrics, we add extra
features; otherwise, those extra features have “None” values. Through this way, we expect to
reduce the impact from lyrics in prediction (to be fair for all songs in experiments since not all of
them have lyrics) without totally ignoring lyrics.
For comments, as we have mentioned before, comments may lie. So unless we can do a “smart”
research on enough comments from different reliable resources, we won’t include comments in
our research. Unfortunately, considering workload and time cost, we have to throw them out. As
for electrodermal data, since we don’t have a device to measure listeners’ physical situations
during experiments, we simply ignore it.
After deciding what kind of features we want, we need to define a way to extract. Here, we use a
software tool called Opensmile. Opensmile is an efficient tool for music analysis. It provides
many extracting configurations which help users to extract needed features precisely. Based on
its configuration, we define static features for our further training.
Static feature set has a 6373-dimension scale. This set is defined by the 2013 Computational
Paralinguistics Evaluation (ComParE) Campaign [4]. In this set, the key factors are frame-wise
acoustic low-level descriptors including MFCCs, energy, etc. These descriptors are very relevant
for both Music Information Retrieval and other general sound analysis. And by using Opensmile
(since the Opensmile has special configurations for ComParE), we can conveniently extract such
static features out.
After clarifying the features and labels we want, we split the PMEmo data to build a training data
set for our models.
4
3.3 Model Design and Evaluation
When we design our machine learning models, we first choose several regression models based
on our knowledge and experience. Our chosen models are lasso regression, elastic net regression,
ridge regression, k-nearest neighbor regression, support vector regression, decision tree
regression, and random forest regression. These regression algorithms are generally good and
have a lot of applications in real life. But whether they are good for our research needs to be
further tuned and evaluated.
The way that we evaluate our models is to use root measured square error. We first evaluate the
models with default parameters (parameters are set up automatically by method) using the test
set (or validation set) split out from the training set. And we get following result (Table 1):
Model
Lasso
Elastic
Ridge
kNN
SVR
DT
RF
RMSE
0.184
0.184
0.140
0.136
0.227
0.131
0.155
Table 1
Considering test error is generally worse than training error, we expect that the baseline for
models’ root measured square error should be 0.1. However, none of the models achieve our
request. In order to improve their performances, we use GridSearchCV algorithm to do
hyperparameter tuning for each model. After many tries, we finally get better models with
appropriate parameters (Table 2):
Model
Lasso
(optimize
alpha)
Elastic
(optimize
alpha)
Ridge
(optimize
alpha)
kNN
(optimize k
neighbors)
SVR-rbf
(optimize
kernel)
DT
(optimize
random
state)
RF
(optimize
random
state)
RMSE
0.119
0.119
0.098
0.121
0.110
0.122
0.135
Table 2
After hyperparameter tuning, our models have better performances. However, we still want to
improve. Finally we decide to use the stack method to build a better regressor based on the
models we have. Using the stack method can not only improve the performance but also prevent
the potential overfitting or underfitting while training. We stack the top four models with
parameters and successfully build a better stacked model whose root measured square error is
only 0.072. This is our final model which will be used in later experiments.
5
4. Experiment Construction
In this section, we will introduce detailed experiment designs, including music selection,
experiment procedure, and our experiment system.
4.1 Music Selection
To test the effect of music preference, the first step is to define the categories of music that the
subject could choose from. Here, we had two options of categorization:
1) Use genres of music like jazz, rock, and pop.
2) Use the adjectives that could characterize music like epic, romantic, and funny.
Option 1 seems more intuitive since this is how music is categorized in our daily life. However,
in the field of music therapy, option 2 seems to be a better choice. First of all, in music therapy,
we are focused on the subject’s emotional state. Compared with genres of music, those adjectives
are more closely associated with emotions. As the subject sees the adjectives, he/she would soon
know the emotions expressed by the music. Moreover, the arousal values within each genre of
music could vary greatly. For example, some pop music is loud and exciting while some other
pop music is soft and soothing. In this case, the average arousal values for each genre are not
very useful for our experiment. Using the adjectives, on the other hand, could ensure that the
variance of arousal values within each category is relatively small. For these two reasons, we
chose option 2 to categorize the music.
On FESLIYANSTUDIOS, a royalty free music website, we found the categorization of music
that we need. Since the music would be used for music therapy, we excluded some inappropriate
categories like Scary and Suspenseful. The final categories that we picked were Epic, Funny,
Happy, Motivating & Inspiring, Peaceful & Relaxing, and Romantic. For each category, 10 to 20
pieces of music were chosen. Then, we separated 30-second slices from each piece of music and
used the model in section 3 to predict their arousal values.
The music category assigned by the website might not perfectly match the testing person’s
opinion. So we made some effort to reduce the probability of disagreement on music type.
During the selection part, we preferred those music belonging to only one category, and we
manually checked the features of music we picked. Also, our main focus is the arousal value of
each music piece. If the subjects don’t agree with the category tag, as long as the arousal value is
low, the experiment result won’t be affected a lot.
Here are the arousal values for six different categories of music.
6
Figure 2: Arousal values of Epic music
Figure 3: Arousal values of Funny music
7
Figure 4: Arousal values of Happy music
Figure 5: Arousal values of Motivating & Inspiring music
8
Figure 6: Arousal values of Peaceful & Relaxing music
Figure 7: Arousal values of Romantic music
For the above six figures, the x-axis represents the #id of our music slices, the y-axis represents
the arousal value achieved using our model trained in section 3. Each red star is one slice of
9
music we would use in the experiment. The #id number of the 30-seconds slices were generated
randomly.
Since the number of music varies among six categories, and different music has different length,
the sample size of different categories could have a large divergence.
We then calculated the average arousal value for six categories separately.
Category
Epic
Funny
Happy
Motivati
ng &
Inspiring
Peaceful
&
Relaxing
Romantic
Number of slices
61
56
92
128
104
64
Average arousal value
0.484900
0.428562
0.406093
0.440364
0.138315
0.243290
Standard Deviation
0.116542
0.171064
0.150540
0.157373
0.126787
0.126812
Table 3: Average arousal values for six different categories
Figure 8: Averaged arousal values for six different categories
From the results above, we can see that Peaceful & Relaxing music has the lowest average
arousal value and Epic music has the highest average arousal value. The standard deviations of
Epic music, Peaceful music and romantic music are relatively low compared to funny music.
Therefore, we selected one piece of music from each category with the lowest arousal value and
used them in the intervention phase. More details will be covered in the following section.
10
4.2 Detailed Experiment Design
We designed a between-subjects experiment and collected data from 18 subjects online. These
subjects are divided evenly into the control group (N = 9) and the experimental group (N = 9). In
the introduction of our experiment, the subjects were only informed that we were studying the
effect of music therapy, so they did not know the hypothesis. Next, the subjects would go
through pre-test, intervention, and post-test. The following are detailed steps in each phase.
Phase 1: Pre-test
a. We would use Affective Slider to evaluate subjects’ arousal. In our system, their choices
would be mapped on the scale [0, 1] (Figure 9).
b. We would use PANAS-SF to evaluate subjects’ emotional states. Based on their answers
to the 20 questions, the system would calculate their Positive and Negative Affect Scores
(Figure 10).
c. Subjects would indicate their music preferences from the set {Epic, Funny, Happy,
Motivating & Inspiring, Peaceful & Relaxing, Romantic} (Figure 11).
Figure 9
11
Figure 10
Figure 11
Phase 2: Intervention
Subjects in the experimental group would listen to a single piece of music with the lowest
arousal value in their preferred categories. Subjects in the control group would listen to a single
piece of music with the lowest arousal value in all categories, so they would always listen to
Peaceful & Relaxing music. All pieces of music we selected are pure music that does not have
lyrics. After listening to music, subjects would proceed to post-test. They did not need to rate the
mood expressed by the music.
12
Phase 3: Post-test
a. We would use Affective Slider to evaluate subjects’ arousal after listening to music.
b. We would use PANAS-SF to evaluate subjects’ emotional states after listening to music.
5. Results and Analysis
Figure 12 shows the changes in valence and arousal for 9 subjects in the control group (blue data
points: pre-test / orange data points: post-test.).
Figure 13 shows the changes in valence and arousal for 9 subjects in the experimental group
(blue data points: pre-test / orange data points: post-test.).
Table 4 shows the statistics on increase in valence and reduction in arousal for both groups.
Table 5 shows the t-values and p-values after running t-tests on increase in valence and reduction
in arousal.
As well can see, compared with the experimental group, the control group has greater increase in
valence and greater reduction in arousal, but the results are not significant as indicated by large
p-values.
Initial valence and arousal may correlate with increase in valence and reduction in arousal, so we
calculated Pearson’s correlations for all conditions (Table 6). Among them, we observed high
positive correlation between initial arousal and reduction in arousal in the control group, as well
as high negative correlation between initial valence and increase in valence in the experimental
group. These results suggest that increase in valence is negatively correlated with initial valence
if subjects listened to the music they preferred, and reduction in arousal is positively correlated
with initial arousal if subjects did not listen to the music they preferred. This means that initial
valence and arousal may also contribute to change in valence and arousal, but the sample size
may be too small to reach high significance.
We are also interested in whether the subjects’ preferred categories are correlated with their
initial valence and arousal, so we showed subjects’ initial valence and arousal along with their
preferred categories on a scatter plot (Figure 14). As we can see, the distributions of preferred
categories are somewhat random, so subjects’ preferred categories are not correlated with their
initial valence and arousal.
13
Figure 12
14
Figure 13
Average
Standard Deviation
Control group:
Increase in valence
0.0333
0.0205
Experimental group:
Increase in valence
0.0189
0.0456
Control group:
Reduction in arousal
0.0689
0.0300
Experimental group:
Reduction in arousal
0.0544
0.0566
Table 4
15
t-value
p-value
Increase in valence
0.8173
0.4258
Reduction in arousal
0.6380
0.5325
Table 5
Pearson’s
correlation
Control group:
Initial valence
vs.
Increase in
valence
Experimental
group:
Initial valence
vs.
Increase in
valence
Control group:
Initial arousal
vs.
Reduction in
arousal
Experimental
group:
Initial arousal
vs.
Reduction in
arousal
-0.4202
-0.7779
0.9428
0.3944
Table 6
Figure 14
16
6. Conclusion
According to our data and analysis, our hypothesis that “Subjects who listen to their preferred
music will have greater reduction in arousal than subjects who listen to music with the lowest
arousal value” may not be valid. This could be caused by small sample size. Interestingly, as we
examined the data, we observed two subjects in the experimental group who had increases,
instead of decreases, in arousal after listening to preferred music. One of them preferred Happy
music, another subject preferred Funny music. This may provide evidence for the claim that
“Subject preferred music is too distracting and therefore stimulates the subject rather than
increasing relaxation."
We also discovered that increase in valence is negatively correlated with initial valence if
subjects listened to the music they preferred, and reduction in arousal is positively correlated
with initial arousal if subjects did not listen to the music they preferred. We are not clear about its
cause, but we guess that higher initial arousal would make space for greater reduction in arousal.
Similarly, higher initial valence would limit the space for increase in valence.
In future studies, we hope to recruit more subjects so that we can yield conclusions with high
significance. At the same time, we will try to use Option 1 in section 4.1 to categorize our music
to see whether different categorizations have different effects. Moreover, we will collect more
pieces of music in our database and learn about more research in the field of music therapy. Once
we have considerable amounts of music and solid theoretical backgrounds, our experiment
system could be easily transformed into an application useful for conducting music therapy.
17
Reference
[1] Jiang, J., Rickson, D., Jiang, C., The mechanism of music for reducing psychological stress:
Music preference as a mediator, The Arts in Psychotherapy, Volume 48, 2016, Pages 62-68,
https://doi.org/10.1016/j.aip.2016.02.002
[2] Cori L. Pelletier, MM, MT-BC, The Effect of Music on Decreasing Arousal Due to Stress: A
Meta-Analysis, Journal of Music Therapy, Volume 41, Issue 3, Fall 2004, Pages 192–214,
https://doi.org/10.1093/jmt/41.3.192
[3] Zhang, K., Zhang, H., Li, S., Yang, C., & Sun, L. (2018). The PMEmo Dataset for Music
Emotion Recognition. Proceedings of the 2018 ACM on International Conference on Multimedia
Retrieval. doi:10.1145/3206025.3206037
[4] Björn Schuller, Stefan Steidl, Anton Batliner, Alessandro Vinciarelli, Klaus Scherer, Fabien
Ringeval, Mohamed Chetouani, Felix Weninger, Florian Eyben, Erik Marchi, et al. 2013. The
INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion,
autism. In Proceedings INTERSPEECH 2013, 14th Annual Conference of the International
Speech Communication Association, Lyon, France
18
Download