Analysis of Music Popularity on Spotify ID: 34716203 1 Introduction This report uses a range of statistical techniques including descriptive statistics, hypothesis testing, ANOVA, regression to give insight on factors that may affect tracks popularity on Spotify. Factors including audio features, factual factor (e.g. number of artists contributing to the track) and some relatively arbitrary and subjective factor (i.e. genre of the track). 2 Dataset Description The report uses dataset ”30000 Spotify Songs” downloaded from kaggle. The dataset was originally collected through certain genre-based playlists on Spotify, thus a few duplicate tracks was found as one track can be included in multiple playlists. As the main subject for this analysis is audio features of tracks, those duplicate records only representing different playlists was discarded. That said, the report also involves analysis of other factual features of tracks, for example, year of release, genre, etc. However, as mentioned, the reason for duplicate values is the multiple appearances of the same song in potentially different genre-based playlists, therefore genre can be a quite controversial attribute. This is in fact true as it’s often hard to determine the genre of a track given classification standard can be arbitrary and related genres overlap a lot. After pre-processing, 18 columns in total are kept for the following analysis. Pre-processing was altogether done in Python as Spotify API needs to be used to pull data regarding artists information of tracks as the original dataset only lists one artist per track. All the other analysis were done in R. table 1 gives the data dictionary. 3 Exploratory Data Analysis 3.1 Top 20 Artists fig. 1 shows the average track popularity of artists having the most tracks in the dataset. We can see artists of both recent and last generation, among which the popularity ranges from around 20 to 60. It’s not surprising to see there are 130 tracks coming from Queen, however the relatively low average popularity might be attributed to the year those tracks were released. While fig. 2 shows the ranking in a different way, picking out artists with highest average track popularity. Though these artists having high score, as can see, number of tracks included is relatively small. It might only be a singular hit, not a representative indicator for the popularity of the artist. 1 Column Name track popularity artists artists count released year released month playlist genre key mode tempo danceability energy loudness valence acousticness instrumentalness liveness speechiness duration Description The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Duplicate tracks (e.g. the same track from a single and an album) are rated independently. Name of the artist(s) of the song Number of artists contributing to the song Year when the song was released Month when the song was released. 0 represents unknown. Genre of the playlist from which the track is found The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation. Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. Beats per minute Suitability a track is for dancing. A value of 0.0 is least danceable and 1.0 is most danceable. Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. The overall loudness of a track in decibels (dB). Values typical range between -60 and 0 db. A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. Predicts whether a track contains no vocals. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Duration of song in seconds Table 1: Data Dictionary 2 Figure 1: Average Track Popularity of Artists with Top 20 number of Tracks Figure 2: Number of Tracks of Artists with Top 20 Average Track Popularity 3.2 Track Popularity Distributions and Average Track Popularity fig. 3 shows distribution of track popularity. There’s a big number of track whose popularity is close to 0, which is predictable as there are near 30000 tracks however normally songs that are popular among the mainstream would not be of a great number. Leaving aside the extreme left part, the distribution shows a bell shape which could be similar to normal distribution. fig. 4 shows average track popularity and number of tracks by year, month, genre and key respectively. Leaving aside year 2020 as the dataset was collected in the middle of 2020, the number of tracks rough is always increasing as time moves forward. That might be lead by the selection bias when Spotify or users creating these playlists as people tend to listen to more recent music. Following the similar rationale, the popularity of tracks that were released last century not seemingly to be rivaled by those recent ones could be because if those old songs are still selected into today’s playlists, it might of high probability they stand the time and are really popular song of all time. In contrast, there might be a lot of recent songs being selected however not very popular. Popularity over month of which tracks were released doesn’t seem to differ, which is not surprising as common sense won’t give us any straightforward reason why popularity would have any link to which month the track was released. However, it’s quite interesting January has the most tracks released in the scope of this dataset. Across all time, it seems pop would be the most popular one. However, it doesn’t distinguish all the others a lot except for EDM (Electronic Dance Music). The report will dig deeper into the interaction between year and genre later. Average Track Popularity Distribution by Key shows even minor differences. However, it’s worth noting tracks with key 3 appears distinctly the least in the dataset. 3 Figure 3: Track Popularity Distribution Figure 4: Average Track Popularity by Different Dimensions Another reason for plotting out the number of tracks in each group is that in the following hypothesis testing, sufficient size of sample in each group would be needed to make the result valid. Thus released year is grouped into ”before 2000”, ”2000-2010”, ”2011-2020” for later analysis. 3.3 Audio Features Distribution fig. 5 shows the distribution of audio features. Valence Distribution is similar to a fat bell curve, indicating most music are in between very positive and very negative and all tracks in general are similar in terms of valence. Tempo and duration are also quite evenly distributed at two sides. Danceability, energy, loudness are all skew to the right, which might be the mainstream taste. The left skewness of acousticness and speechiness match with right skewness behavior of three features just mentioned. The standing graph of instrumentalness indicates there are almost zero light music in the dataset. The left skewness of liveness shows most tracks in the dataset are produced from studio instead of obtained from live performance. 4 Hypothesis Testing 4.1 H0 : the number of artists doesn’t affect track popularity 4.1.1 Methodology table 2 shows number of tracks in each number of artists group. As population mean comparisons will be carried out, without solid normal population, central limit theory would be needed to ensure the validity of the result, which requires sufficient sample size. Considering the total volume and size 4 Figure 5: Features Distribution in other groups, threshold is set to be 100. Thus only tracks made under 5 artists (inclusive) would be considered. 1 18105 2 7080 3 2280 4 584 5 189 6 50 7 37 8 14 9 9 10 3 11 1 Table 2: # Tracks in Each Number of Artists Group As there are multiple group means to be compared, one-way ANOVA will be used first to test if all population means are the same. After that, if the result shows difference between the population means is significant, then following pair-wise tests will be performed to find out the difference in between groups. Specifically, Tukey’s HSD test and Fisher’s LSD test will be run. 4.1.2 Result F value is 32.23, and the corresponding p value is less than 2−16 , thus H0 is rejected, there are significant mean popularity difference between tracks of different number of contributing artists. HSD test and LSD test give the same conclusion as shown in the output of LSD test (table 3). Tracks with 4 and 5 artists are in group a (their means are not significantly different). Tracks with 3 and 2 artists are in the same group b. Tracks with 1 artist are themselves in one group c. The result also shows the order of groups in terms of popularity. Number of Artists 4 5 3 2 1 Track Popularity 45.41952 45.12698 41.22632 40.70113 38.21298 Groups a a b b c Table 3: LSD Test Result for 4.1.2 The interpretation could be the more artists involve in the creation, the popular the track would be, which fits common sense in that if one artist is famous, he or she is likely to be invited to cooperate on a track or is able to find someone to cooperate, while tracks with one artist, as also indicated by the number might not need high requirement, thus less popular. 5 Figure 6: Residual Histogram for 4.1.3 However, it’s worth noting LSD test would increase the first type error, that is it is highly likely to reject at least one pair of hypothesis even the null hypothesis is true. Thus for multiple comparisons (usually more than 3 groups) HSD test would be a more reliable method. 4.1.3 Assumption Testing There are 3 main assumption for one-way ANOVA test: data points are independent, the responses for each factor level have a normal distribution with same variance. The first one holds as these tracks are picked from random playlists. The second can be checked by looking at the model’s residual histogram. If residuals are roughly normal distributed, then the second assumption should be met. As can see from fig. 6, the left side of the graph slightly violates this assumption. The third one can be checked by conducting Levene’s test for homogeneity. The result shows p value is way smaller than 0.05, so the third assumption fails. However in general the result would generally be robust when sample size increases and more so if the sample sizes for all factor levels are equal. Since sample sizes we have are fairly large. The result can still be valid. 4.2 H0 : no mean popularity difference between different keys All the methodology is similar to 4.1. All key groups are kept because all groups have fairly large number of sample points (fig. 4). F value 2.796 corresponds to p value 0.00121 which is significant, so we reject the null hypothesis, concluding that population means of different key groups are different. LSD test and HSD test give different results this time with LSD test divides key groups into 6 groups while HSD test only rejects 3 out of 66 hypotheses at significance level 0.05. For the assumption testing, the residual graph shows similar pattern as in 4.1.3 with left side slightly violates the assumption. But this time p value of Levene’s Test is 0.26, thus the third assumption holds. With sufficient size of sample, slight violation of normal distribution is fine. 6 4.3 H0 : no mean popularity difference between different genres All the methodology is similar to 4.1. All genre groups are kept because all groups have fairly large number of sample points (fig. 4). F value 257.7 corresponds to p value less than 0.01 which is significant, so we reject the null hypothesis, concluding that population means of different genre groups are different. LSD test (table 4) and HSD test give similar results this time, concluding rap and latin have no difference while all the rest are significantly difference with each other. Genre pop rap latin rock r&b edm Track Popularity 45.90530 41.84605 41.44971 39.69431 35.92940 30.67829 Groups a b b c d e Table 4: LSD Test Result for 4.3 Assumption testing has similar result as 4.1.3. 4.4 H0 : There’s interaction effect between track released year and genre on popularity 4.4.1 Methodology From common sense, people’s music taste may change over time. In order to study potential interaction effect between released year and genre on popularity, two-way ANOVA should be used. It’s usually useful to understand interacting effects using some graphs and with 3 variables one can easily find ways to plot them out. Similar to one-way ANOVA, if the global F test is significant, then post hoc test can be done to determine which groups differ from the other. Finally, Assumption testing is conducted to justify the validity of the result. 4.4.2 Interaction Effects Exploration fig. 7 shows the changes of popularity overtime by genre and the number of observation in each interaction group is also presented. Interestingly, in all 6 genres, tracks from 2000-2010 have on average the lowest popularity while tracks from 2011-2020 tops 4 out of 6 genres. For edm, the reason tracks from before 2000 come at top could follow the same logic as mentioned earlier about small size of data before 2000. As for rock, it seems solid that tracks from before 2000 are much popular than recent ones. fig. 8 gives popularity distribution of all 18 interaction groups. It’s quite obvious all groups in 20002010 have a wider IQR compared to their counterparts before 2000 or in 2011-2020. Another observation is that the relative position of mediums and length of IQR within year period are quite similar in three year period. 7 Figure 7: Changes of Popularity Overtime by Genre Figure 8: Popularity Distribution by Released Year Period * Genre Interaction Group 4.4.3 ANOVA Result and Post hoc Analysis F value for the interaction term is 22.45 and corresponding p value is 2−16 . Thus we conclude the released year period, genre, and the interaction between the two variables are all statistically significant at the 0.05 significance level. fig. 9 presents the 95% confidence interval for mean difference between interaction groups. The graph can be hard to read due to large number of comparisons between interaction groups. If we fixed one variable at a time, we will get the following result. Before 2000, edm shows no difference between all the other genre, rock is more popular than rap, rap is more popular than r&b, pop is more popular than r&b, however, no evidence shows difference between pop and the others genre. In 2000-2010, pop is the most popular, rap, latin and rock shows no popularity difference, and they are all popular than r&b, edm is the least popular. In 2010-2020, the popularity can be shown in a ordinal form, pop > rap > latin > r&b > rock > edm. Overtime, r&b experienced decrease of popularity from before 2000 to 2000-2010, then rise of popularity in 2011-2020; rap follows the same pattern; popularity of rock keeps decreasing; popularity of pop doesn’t change from before 2000 to 2000-2010, then increases in 2010-2020; edm and latin follow the same pattern as pop; 8 Figure 9: 95% Confidence Intervals for Mean Difference Between Groups 4.4.4 Assumption Testing The assumption testing is similar to the one of one-way ANOVA, the result is similar to 4.2. As we have fairly big sample size, the result can be reckoned as valid. 5 Regression 5.1 Multiple Regression: Find features making a track popular 5.1.1 Pairwise Relationships Exploration fig. 10 shows the pairwise correlation coefficient between all numeric variables. None of these relationships show strong correlation. A few pairs are worth noting: valence and danceability, acousticness and energy, loudness and energy, acousticness and loudness. The sign of correlation coefficients are expected. fig. 11 gives the scatter plot between all potential regressor and track popularity. It seems no relationship and this matches with the low correlation coefficient between track popularity and all other variables as shown in fig. 10. fig. 12 shows the 3 boxplot for 3 categorical variables. It seems no obvious distribution difference between different key groups or mode groups. 3 different released year groups seem to have different distribution of popularity. Those exploration of relationships between response and potential contributing features help direct the regression and predict the result to some extent. 9 Figure 10: Heatmap of Numeric Variables Figure 11: Numeric Variables v.s. Track Popularity Figure 12: Categorical Variables v.s. Track Popularity 10 5.1.2 Model Specification The response for the model is track popularity and candidate features are danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration, key, mode, released year period and mode. As said before, the genre classification could be subjective and controversial sometimes, and it is partly decided by those audio features mentioned before, which are relatively more objective and give factual aspect of the track. Among 12 variables, 4 of them are categorical variables, thus one hot encoding is used. 5.1.3 Result As the exploratory analysis shows above, one would expect not being able to find strong linear relationship between those features and response and the model wouldn’t perform well. That said, backward selection was used in that forward selection might leading to not include any features at all. fig. 13 shows the multiple regression result. The total explainability of the model (adjusted R squared) is 0.1117. Some of the result matches with the expectation while some others are not. For example, the coefficient of artists count dummy variables increases when number artists increases, which is aligned with the result in hypothesis testing while energy has a negative coefficient of relatively large magnitude. Other observation includes if tracks are from 2000-2010, then it would negatively contribute to the popularity. Pop tracks tends to be popular. 5.2 Simple Regression: Check the significance level of relationship found in pairwise correlation Four relatively strong correlations are spotted earlier, and simple regression is used to check if the coefficient is significant under 5% level. Result shows all four pairs’ relationship: valence and danceability, acousticness and energy, loudness and energy, acousticness and loudness are significant under 1% level. 11 Figure 13: Multiple Regression Result 12
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )