The State of Music at LabROSA Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA ! dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/ 1. The State of LabROSA 2. Music Projects 3. The Big Picture Music at LabROSA - Dan Ellis 2014-01-25 - 1 /10 case bold letters (e.g. x, d s, and z) denote vectors. f 2 NMF {1, 2, · · · , F } isBeta used to Process index frequency. t 2 {1, 2, · · · , T } Hoffman is used to index time. k 2 {1, 2, · · · , K} is used Liang, to index dictionaryAutomatically components. choose how many BP-NMF is formulated as: components to use • X = D(S (1) Z) + E [5] C ri S N (a) The selected components learned from single-track instrument. For each instrument, the components are sorted by approximated fundamental frequency. The dictionary is cut off above 5512.5 Hz for visualization purposes. Music at LabROSA - Dan Ellis [6] N on A M 2014-01-25 [7] - 2 /10 Structure Similarity • CK-1 image similarity uses MPEG Video Silva, Papadopoulos compression can exploit shifted parts of image • Match pieces based on structure recurrence plots (Bello’11) tly. For ind method is ], the worst eter setting differences ts when we d an exceleter values. (a) 500 1 250 0.5 0 250 0 1 250 0.5 0 250 500 0 0 500 0 1 250 500 0 Music at LabROSA - Dan Ellis (b) 0 0 (c) 500 (d) 250 500 500 1 250 0.5 0 0 0 250 500 2014-01-25 - 3 /10 4, one can hear some high frequency isolated coefficients superimposed to the separated voice. This drawback could be reduced by including harmonicity priors in the sparse component of RPCA, as Papadopoulos proposed in [20]. • Ground versus estimated voice activity location. ImRPCAtruth separates vocals and background perfect voice location still allows an improvebasedactivity on low rankinformation optimization ment, although to a lesser extent than with ground-truth voice acsingle trade-off parameter tivity information. The decrease in the results mainly comes from adjust basedclassified on higher-level features? background segments as vocalmusical segments. Block Structure RPCA • e. 4. Separated Music atFig. LabROSA - Dan Ellis voice for various values of λ for the Pink Noise Party song - 4 /10 2014-01-25 Ordinal LDA Segmentation McFee • Low-rank decomposition of skewed selfsimilarity to identify repeats • Learned weighting of multiple factors to segment 2.1. Latent structural repetition Music at 80 65 64 32 48 -1 32 -34 16 -67 0 0 98 16 32 48 64 Beat 80 96 -100 0 Filtered self-sim. 32 48 64 Beat 80 96 Latent repetition 1 2 Factor 32 -1 -34 3 4 5 6 -67 -100 0 16 0 65 Lag Skewed self-sim. 98 Lag Linear Discriminant Analysis between adjacent segments Beat Figure 1 outlines our approach for computing structural repetition features, which is adapted from Serrà et al. [2]. First, we extract beat-synchronous features (e.g., MFCCs or chroma) from the signal, and build a binary self-similarity matrix by linking each beat to its nearest neighbors in feature space (fig. 1, top-left). With beat-synchronous features, repeated sections appear as diagonals in the self-similarity matrix. To easily detect repeated sections, the self-similarity matrix is skewed by shifting the ith column down by i rows (fig. 1, top-right), thus converting diagonals into horizontals. Using nearest-neighbor linkage to compute the selfsimilarity matrix results in spurious links and skipped connections. Serrà et al. resolve this by convolving with a Gaussian filter, which effectively suppresses noise, but also blurs segment boundaries. Instead, we use a horizontal median filter, which (for odd window length) produces a binary matrix, suppresses links which do not belong to long sequences of repetitions, and fills in skipped conections (fig. 1, bottom-left). Because median filtering preserves edges, we may expect more precise boundary detection. LabROSA Ellis Let R- 2Dan R2t⇥t denote the median-filtered, skewed self- Self-similarity 96 7 16 32 48 64 Beat 80 96 0 16 32 48 64 Beat 80 96 Fig. 1. Repetition features derived from Tupac Shakur — Trapped. Top-left: a binary self-similarity (k-nearest2014-01-25 - 5 /10 “Remixavier" Raffel • Optimal align-and-cancel of mix and acapella timing and channel may differ Music at LabROSA - Dan Ellis 2014-01-25 - 6 /10 Singing ASR McVicar • Speech recognition adapted to singing needs aligned data • Extensive work to match up scraped “acapellas” and full mix including jumps! Music at LabROSA - Dan Ellis 2014-01-25 - 7 /10 Million Song Dataset • Many Facets Bertin-Mahieux McFee Echo Nest audio features + metadata Echo Nest “taste profile” user-song-listen count Second Hand Song covers musiXmatch lyric BoW last.fm tags ! • Now with audio? resolving artist / album / track / duration against what.cd Music at LabROSA - Dan Ellis 2014-01-25 - 8 /10 MIDI-to-MSD Raffel Shi • Aligned MIDI to Audio is a nice transcription ! ! ! ! ! ! ! ! ! • Can we find matches in large databases? Music at LabROSA - Dan Ellis 2014-01-25 - 9 /10 Summary • Basic techniques beat tracking, segmentation, chord recognition, transcription ! • More data audio alignments aligned transcriptions ! • Sharing code and data Music at LabROSA - Dan Ellis 2014-01-25 - 10/10