Document 10439030

advertisement
The State of Music
at LabROSA
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
!
dpwe@ee.columbia.edu
http://labrosa.ee.columbia.edu/
1. The State of LabROSA
2. Music Projects
3. The Big Picture
Music at LabROSA - Dan Ellis
2014-01-25 - 1 /10
case bold letters (e.g. x, d s, and z) denote vectors. f 2
NMF
{1, 2, · · · , F } isBeta
used to Process
index frequency.
t 2 {1, 2, · · · , T }
Hoffman
is used to index time. k 2 {1, 2, · · · , K} is used Liang,
to index
dictionaryAutomatically
components. choose how many
BP-NMF
is formulated
as:
components
to use
•
X = D(S
(1)
Z) + E
[5] C
ri
S
N
(a) The selected components learned from single-track instrument. For each instrument, the components are sorted by approximated fundamental frequency. The dictionary is cut off above
5512.5 Hz for visualization purposes.
Music at LabROSA - Dan Ellis
[6] N
on
A
M
2014-01-25 [7]
- 2 /10
Structure Similarity
• CK-1 image similarity uses MPEG Video
Silva,
Papadopoulos
compression
can exploit shifted parts of image
• Match pieces based on structure recurrence
plots (Bello’11)
tly. For ind method is
], the worst
eter setting
differences
ts when we
d an exceleter values.
(a)
500
1
250
0.5
0
250
0
1
250
0.5
0
250
500
0
0
500
0
1
250
500
0
Music at LabROSA - Dan Ellis
(b)
0
0
(c)
500
(d)
250
500
500
1
250
0.5
0
0
0
250
500
2014-01-25 - 3 /10
4, one can hear some high frequency isolated coefficients superimposed to the separated voice. This drawback could be reduced by
including harmonicity priors in the sparse component of RPCA,
as
Papadopoulos
proposed in [20].
• Ground
versus estimated
voice
activity location. ImRPCAtruth
separates
vocals and
background
perfect voice
location
still allows an improvebasedactivity
on low
rankinformation
optimization
ment, although to a lesser extent than with ground-truth voice acsingle trade-off
parameter
tivity information.
The decrease
in the results mainly comes from
adjust
basedclassified
on higher-level
features?
background
segments
as vocalmusical
segments.
Block Structure RPCA
•
e.
4. Separated
Music atFig.
LabROSA
- Dan Ellis voice for various
values of λ for the Pink Noise Party
song - 4 /10
2014-01-25
Ordinal LDA Segmentation
McFee
• Low-rank decomposition of skewed selfsimilarity to identify repeats
• Learned weighting of multiple factors to
segment
2.1. Latent structural repetition
Music at
80
65
64
32
48
-1
32
-34
16
-67
0
0
98
16
32
48 64
Beat
80
96
-100
0
Filtered self-sim.
32
48 64
Beat
80
96
Latent repetition
1
2
Factor
32
-1
-34
3
4
5
6
-67
-100
0
16
0
65
Lag
Skewed self-sim.
98
Lag
Linear Discriminant Analysis between adjacent segments
Beat
Figure 1 outlines our approach for computing structural repetition features, which is adapted from Serrà et al. [2]. First, we
extract beat-synchronous features (e.g., MFCCs or chroma)
from the signal, and build a binary self-similarity matrix by
linking each beat to its nearest neighbors in feature space
(fig. 1, top-left). With beat-synchronous features, repeated
sections appear as diagonals in the self-similarity matrix. To
easily detect repeated sections, the self-similarity matrix is
skewed by shifting the ith column down by i rows (fig. 1,
top-right), thus converting diagonals into horizontals.
Using nearest-neighbor linkage to compute the selfsimilarity matrix results in spurious links and skipped connections. Serrà et al. resolve this by convolving with a
Gaussian filter, which effectively suppresses noise, but also
blurs segment boundaries. Instead, we use a horizontal median filter, which (for odd window length) produces a binary
matrix, suppresses links which do not belong to long sequences of repetitions, and fills in skipped conections (fig. 1,
bottom-left). Because median filtering preserves edges, we
may expect more precise boundary detection.
LabROSA
Ellis
Let R- 2Dan
R2t⇥t
denote the median-filtered, skewed self-
Self-similarity
96
7
16
32
48 64
Beat
80
96
0
16
32
48 64
Beat
80
96
Fig. 1. Repetition features derived from Tupac Shakur
— Trapped. Top-left: a binary self-similarity
(k-nearest2014-01-25
- 5 /10
“Remixavier"
Raffel
• Optimal align-and-cancel of mix and acapella
timing and channel may differ
Music at LabROSA - Dan Ellis
2014-01-25 - 6 /10
Singing ASR
McVicar
• Speech recognition adapted to singing
needs aligned data
• Extensive work to match up scraped
“acapellas” and full mix
including jumps!
Music at LabROSA - Dan Ellis
2014-01-25 - 7 /10
Million Song Dataset
• Many Facets
Bertin-Mahieux
McFee
Echo Nest audio features
+ metadata
Echo Nest “taste profile”
user-song-listen count
Second Hand Song covers
musiXmatch lyric BoW
last.fm tags
!
• Now with audio?
resolving artist / album / track / duration against what.cd
Music at LabROSA - Dan Ellis
2014-01-25 - 8 /10
MIDI-to-MSD
Raffel
Shi
• Aligned MIDI to Audio is a nice transcription
!
!
!
!
!
!
!
!
!
• Can we find matches in large databases?
Music at LabROSA - Dan Ellis
2014-01-25 - 9 /10
Summary
• Basic techniques
beat tracking, segmentation, chord recognition,
transcription
!
• More data
audio
alignments
aligned transcriptions
!
• Sharing code and data
Music at LabROSA - Dan Ellis
2014-01-25 - 10/10
Download