Chord Recognition and Segmentation using EM-trained Hidden Markov Models • Chord Recognition • EM-trained HMMs • Experiments & Results Alex Sheh & Dan Ellis {asheh79,dpwe}@ee.columbia.edu LabROSA, Columbia University Chord Recognition - Sheh & Ellis 2003-10-29 - p. 1 /19 Chord Transcription problem: • Basic Recover chord sequence labels from audio Easier than note transcription ? More relevant to listener perception ? Chord Recognition - Sheh & Ellis 2003-10-29 - p. 2 /19 Difficulties with Chord Transcription • Enharmonicity: Chord labels can be ambiguous C# vs Db • Many different chord classes major, minor, 6th, 9th, ... fold into 7 ‘main’ classes: maj, min, maj7, min7, dom7, aug, dim • Acoustic variability chords are the same regardless of instrumentation Chord Recognition - Sheh & Ellis 2003-10-29 - p. 3 /19 Approaches to Chord Transcription • Note transcription, then note→chord rules like labeling chords in MIDI transcripts • Spectrum→chord rules i.e. find harmonic peaks, use knowledge of likely notes in each chord • Trained classifier don’t use any “expert knowledge” instead, learn patterns from labeled examples Chord Recognition - Sheh & Ellis 2003-10-29 - p. 4 /19 Statistical-Pattern-Recognition Chord Recognizer • Use labeled training examples to estimate p(x|wi) for features x and chord class wi Labeled training examples {xn,ωxn} Sort according to class p(x|ω1) Estimate conditional pdf for class ω1 • Use Bayes Rule to get posterior • probabilities for each class given features: p(x|wi) · p(wi) p(wi|x) = Â j p(x|w j ) · p(w j ) Chord Recognition - Sheh & Ellis 2003-10-29 - p. 5 /19 Training Data Sources need (lots of) examples of audio • We segments and the appropriate chord labels Not widely available! Even when you can get it, there is very little • We could hand-mark a training set painfully time-consuming! • Can we generate it automatically? we could, if we already had the chord transcriptions system working... Chord Recognition - Sheh & Ellis 2003-10-29 - p. 6 /19 Speech Recognition Analogy • Chord recognition is like word recognition we are trying to recover a sequence of ‘exclusive’ labels associated with an audio stream we have a lot of potential training audio, but no time labels • Can we do what they do in ASR? .. i.e. iterative re-estimation of models and labels using Expectation-Maximization Need only label sequence, not timings (i.e. words or chords in order, no times) Chord Recognition - Sheh & Ellis 2003-10-29 - p. 7 /19 EM HMM Re-Estimation • Estimate ‘soft’ labels using current models • Update model parameters from new labels • Repeat until convergence to local maximum Model inventory ae1 ae2 dh1 dh2 ae3 Initialization Θ init parameters Repeat until convergence Labelled training data dh ax k ae t s ae t aa n dh s ae E-step: probabilities p(qni |XN 1,Θold) of unknowns M-step: maximize via parameters Chord Recognition - Sheh & Ellis ax k t dh ae t aa n ax k Uniform initialization alignments ae Θ : max E[log p(X,Q | Θ)] 2003-10-29 - p. 8 /19 Chord Sequence Data Sources we need are the chord sequences • All for our training examples • OLGA Tab archives (http://www.olga.net/)? multiple authors, unreliable quality • Hal Leonard “Paperback Song Series” many Beatles songs, consistent detail manually retyped for 20 songs: “Beatles for Sale”, “Help”, “Hard Day’s Night” • Issues: repeats, intros, weird bits in the middle Chord Recognition - Sheh & Ellis 2003-10-29 - p. 9 /19 Experiments investigations to see if this • Preliminary works small database to get started compare different feature sets different ways to evaluate results can we reintroduce a little high-level knowledge? Chord Recognition - Sheh & Ellis 2003-10-29 - p. 10/19 Training/Test Conditions • Two training sets: Train on 18 songs, test on 2 held-out songs (fair test) Train on all 20 songs, test on 2 songs from training set (cheating, rewards over-fitting, upper bound) • Two evaluation measures Recognition: transcribe unknown chords Forced alignment: find time boundaries given correct chord sequence Score frame-level accuracy against hand-markings Chord Recognition - Sheh & Ellis 2003-10-29 - p. 11/19 Features • Can try any features with EM training... • Use MFCCs as baseline “the” feature in ASR, also useful in music IR also try deltas, double deltas as in ASR capture ‘formants’, not pitch - straw man • “Pitch Class Profile” features (Fujishima’99) Spectrum-to-chroma mapping weights Chroma bin (quarter tones) collapse FFT bin energies into (24) chroma bins a/k/a Chroma Spectrum (Bartsch’01) ... 20 15 10 5 20 40 60 80 FFT bin (up to 2000 Hz) Chord Recognition - Sheh & Ellis 100 2003-10-29 - p. 12/19 Averaging Rotated PCP Models system learns separate models for • Statistical each of (7 chord types) x (21 roots) models are means, variances of feature vectors only 32 actually appear in our training set even so, many have few training instances • Expect similarity between e.g. Amaj & Bmaj same chord, just shifted in frequency shift = rotation of 24-bin chroma space align & average all transpositions of • Can same chord after each training iteration then rotate back to starting chroma, continue... Chord Recognition - Sheh & Ellis 2003-10-29 - p. 13/19 Results: Recognition • Models are not adequately discriminant: Percent Frame Accuracy: Recognition Feature MFCCs are poor (can overtrain) PCPs a little better (ROT helps generalization) MFCC MFCC D PCP PCP ROT Chord Recognition - Sheh & Ellis Recog train18 train20 5.9 16.7 7.7 19.6 15.8 7.6 1.5 6.9 10.0 23.6 18.2 26.4 23.3 23.1 20.1 13.1 Eight Days a Week Every Little Thing (random ~3%) 2003-10-29 - p. 14/19 Results: Alignment • Some glimmers of hope! Percent Frame Accuracy: Forced Alignment Feature MFCC MFCC D PCP PCP_ROT PCP ROT best accuracy & best generalization Chord Recognition - Sheh & Ellis Align train18 train20 27.0 20.9 14.5 23.0 24.1 13.1 19.9 19.7 26.3 41.0 46.2 53.7 68.8 68.3 83.3 83.8 Eight Days a Week Every Little Thing 2003-10-29 - p. 15/19 Chord Confusions "E M AJOR" CONFUSION M ATRIX Eight Days a We e k Maj Min Maj7 Min7 Dom7 Aug Dim E/Fb 15 8 115 . 9 . . . E#/F 9 . . . . . . F#/Gb . . . 11 . . . G 3 . . . . . . G#/Ab . . . . . . . A 9 . . . 1 . . A#/Bb . . . . . . . B/Cb 8 . . . , . . B#/C . . . . . . . C#/Db . 20 . . . . . D . 14 . . . . . D#/Eb . . . . . . . • Major/minor confusions • C# is relative minor (shared notes) Chord Recognition - Sheh & Ellis 2003-10-29 - p. 16/19 What did the models learn? model centers (means) • Chord indicate chord ‘templates’: PCP_ROT family model means (train18) 0.4 DIM DOM7 MAJ MIN MIN7 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 C 0 D 5 E F Chord Recognition - Sheh & Ellis 10 G A 15 20 BC 25 (for C-root chords) 2003-10-29 - p. 17/19 Recognition/Alignment Example • A flavor of the features & results: Beatles - Beatles For Sale - Eight Days a Week (4096pt) # G # 120 F 100 pitch class E 80 # D 60 # 40 C 20 B 0 intensity # A 16.27 true align recog 24.84 time / sec E G D E G DBm E G Bm Chord Recognition - Sheh & Ellis Bm G G Am Em7 Bm Em7 2003-10-29 - p. 18/19 Conclusions & Future Work EM is a viable approach for • ASR-style learning chord models more training data needed • Better features capture ‘global’ properties of chords robust to fine tuning issues? • Better representation for training labels i.e. allow for extra repeats? ways to reintroduce • More music knowledge Chord Recognition - Sheh & Ellis 2003-10-29 - p. 19/19