Similarity Algorithms for Music Plagiarism Daniel Müllensiefen Department of Psychology Goldsmiths University of London Structure 1 2 3 4 5 Introduction Method and Examples Algorithms Empirical Study Summary – Future Research Introduction: The Background Huge Public Interest: Importance for the pop industry, interesting emphasis on tunes (melodic plagiarism) Ambigity surrounding UK law in defining a substantail part Lack of Research, Exceptions: Timothy English: Sounds Like Teen Spirit, (2008) Stan Soocher: They Fought The Law, (1999) Charles Cronin: Concepts of Melodic Similarity in Music-Copyright, (1998) Introduction: No hard and fast rules! The 2-second rule The 1-bar rule The 5-notes rule It is more complicated than that! What has been taken needs to be at least a substantial part of the copyright work (Lord Millet, Designers Guild v. Williams [2000] 1 WLR 2426) Introduction: Müllensiefen & Pendzich (2009) The first quantitative study on music plagiarism Questions: Can computer algorithms predict plagiarism incidents from the melodic similarity of tunes? What is the predictive power of these algorithms when superimposed on documented case law? What is the frame of reference (directionality of comparisons)? How can prior musical knowledge be taken into account? Method: Outline Selection of 20 cases* from 1970 to 2005 – focus on melodic aspects of copyright infringement Transcription to monophonic MIDI files Analysis of judges' written opinions Reduction of court decisions to two categories “pro plaintiff” = melodic plagiarism “contra plaintiff” = no infringement Use of context: Corpus of 14,063 pop songs representing pop music history from 1950-2006 *from UCLA/Columbia copyright infringement database: http://cip.law.ucla.edu/ Example 1: Bright Tunes v. Harrisongs (1976, 420 F. Supp) The Chiffons He‘s So Fine, 1963 No. 1 in US, UK highest position 11 George Harrison, My Sweet Lord Published in 1971 No.-1-Hit in US, UK & (West-)Germany Example 2: Selle v. Gibb (1983, 567 F.Supp 1173) Ronald Bee Selle, “Let It End” Unreleased (1975) Gees, “How Deep Is Your Love” (1977) Algorithms Statistically Informed Similarity Algorithms: Importance of frequency of melodic elements for similarity assessment Inspired from computational linguistics (Baayen, 2001) and text processing (Manning & Schütze, 1999; Jurafsky & Martin, 2000) Conceptual Components: m-types (aka n-grams) as melodic elements Frequency counts: Type frequency (TF) and Inverted Document Frequency (IDF) Melodic elements: m-types Word Type t Frequency f(t), Melodic Type τ (pitch interval, length 2) Frequency f(τ), Twinkle 2 0, +7 1 little 1 +7, 0 1 star 1 0, +2 1 How 1 +2, 0 1 I 1 0, -2 3 wonder 1 -2, -2 1 what 1 -2, 0 2 you 1 0, -1 1 are 1 -1, 0 1 Type- and Inverted Document Frequencies C Corpus of melodies m melody τ TF(m, " ) = Melodic type T # different melodic types f m (" ) # $ f (" ) m i=1 |m:τ ∈ m| # melodies containing τ Melodic Type τ (pitch interval, length 2) Frequency f(τ) ! 0, +7 1 +7, 0 1 0, +2 1 +2, 0 1 0, -2 3 -2, -2 1 -2, 0 2 0, -1 1 -1, 0 1 ! i ( IDFC (" ) = log C m:" # m ) Type- and Inverted Document Frequencies C Corpus of melodies m melody TF(m, " ) = τ Melodic type τ T # different melodic types f m (" ) # $ f (" ) m i=1 |m:τ ∈ m| # melodies containing τ Melodic Type τ (pitch interval, length 2) Frequency f(τ) TF(m, ! τ) 0, +7 1 0.083 +7, 0 1 0.083 0, +2 1 0.083 +2, 0 1 0.083 0, -2 3 0.25 -2, -2 1 0.083 -2, 0 2 0.167 0, -1 1 0.083 -1, 0 1 0.083 ! i ( IDFC (" ) = log C m:" # m ) Type- and Inverted Document Frequencies C Corpus of melodies m melody TF(m, " ) = τ Melodic type τ T # different melodic types f m (" ) # $ f (" ) m ( IDFC (" ) = log i i=1 |m:τ ∈ m| # melodies containing τ Melodic Type τ (pitch interval, length 2) Frequency f(τ) TF(m, ! τ) IDFC(τ) 0, +7 1 0.083 1.57 +7, 0 1 0.083 1.36 0, +2 1 0.083 0.23 +2, 0 1 0.083 0.28 0, -2 3 0.25 0.16 -2, -2 1 0.083 0.19 -2, 0 2 0.167 0.22 0, -1 1 0.083 0.51 -1, 0 1 0.083 0.74 ! C m:" # m ) Type- and Inverted Document Frequencies C Corpus of melodies m melody TF(m, " ) = τ Melodic type τ T # different melodic types f m (" ) # $ f (" ) m ( IDFC (" ) = log C m:" # m i i=1 |m:τ ∈ m| # melodies containing τ Melodic Type τ (pitch interval, length 2) Frequency f(τ) TF(m, ! τ) IDFC(τ) TFIDFm,C(τ) 0, +7 1 0.083 1.57 0.13031 +7, 0 1 0.083 1.36 0.11288 0, +2 1 0.083 0.23 0.01909 +2, 0 1 0.083 0.28 0.02324 0, -2 3 0.25 0.16 0.04 -2, -2 1 0.083 0.19 0.01577 -2, 0 2 0.167 0.22 0.03674 0, -1 1 0.083 0.51 0.04233 -1, 0 1 0.083 0.74 0.06142 ! ) TF-IDF Correlation " C (s,t) = ' TFIDF (# )$TFIDF (# ) ' (TFIDF (# )) $ ' (TFIDF t, C 2 # % sn &t n ! s, C # % sn &t n s, C # % sn &t n t, C (# )) 2 Feature-based Similarity Ratio Model (Tversky, 1977): Similarity σ(s,t) related to # features in s and t have common salience of features f() " (s,t) = ! f (sn # tn ) ,$ ,% & 0 f (sn # tn )+ $f (sn \ tn )+ %f (tn \ sn ) features => m-types salience => IDF and TF different values of α, β to change frame of reference Variable m-type lengths (n=1,…,4), entropy-weighted average Feature-based Similarity Tversky.equal measure (with α = β = 1) " (s,t) = & IDF (# ) + & # $ sn %t n & # $ sn %t n C # $ sn \ t n IDFC (# ) IDFC (# ) + &# $ t n \ sn IDFC (# ) Tversky.plaintiff.only measure (with α = 1, β = 0) ! " plaintiff.only (s,t) = & IDFC (# ) # $ sn %t n & # $ sn %tn IDFC (# ) + &# $ s n \ tn IDFC (# ) Tversky.defendant.only measure (with α = 0, β = 1) ! " defendant .only (t,s) = & # $ sn %t n & # $ sn %t n ! Tversky.weighted measure with IDFC (# ) IDFC (# ) + &# $ t n \ sn IDFC (# ) & "= & # $ sn %t n # $ sn ! TFs(# ) and TFs(# ) & "= & # $ sn %t n # $ tn ! TFt (# ) TFt (# ) Evaluation Ground Truth: 20 cases with pro/contra decision (7/13) Evaluation metrics Accuracy (% correct at optimal cut-off on similarity scale) AUC (Area Under receiver operating characteristic Curve) Evaluation 1.0 0.9 0.8 0.7 0.6 0.5 0.4 Accuracy AUC 0.3 0.2 0.1 0.0 t. . f m nen rel. ual ntif ant ed t o i r q h C d it ko ig co v.e .pla fen k d m e T E Su U F Tv .de v.w I- D T Tv TF s Di Evaluation: ROC Curves A Qualitative Look Ronald Selle, “Let It End” Bee Gees, “How Deep Is Your Love” Observations: Decision sometimes based on ‘characteristic motives’ (incl. rhythm) High-level form can be important (e.g. call-and-response structure) Frame of reference can be different Summary Court decisions can be related closely to melodic similarity Plaintiff’s song is often the frame of reference when determining the question of substantial part (“It depends upon its importance to the copyright work. It does not depend upon its importance to the defendants” per Lord Millet) Statistical information about commonness of melodic elements is important Future Research Conduct analogous studies with cases from the UK and Germany and utilize additional US cases Include rhythm in m-types Use higher level features of melodies in addition to m-types Conduct listening experiment to determine cognitive validity Similarity algorithms for music plagiarism Daniel Müllensiefen Department of Psychology Goldsmiths, University of London