Sound Analysis Research at LabROSA Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/ 1. Speech 2. Music 3. Environmental Sound LabROSA Overview - Dan Ellis 2005-04-01 - 1 /17 LabROSA Overview Information Extraction Music Eigenrhythms Machine Meeting Learning turns Environment Personal audio FDLP Signal Processing Speech LabROSA Overview - Dan Ellis 2005-04-01 - 2 /17 1. Speech Analysis / Recognition with Sambarta Bhattacharjee recognizers work for read speech • Speech poorly for spontaneous e.g. 5% errors ➝ 30% • Transform spontaneous speech to read? Spont speech pole freq Read 20 15 10 5 50 100 150 200 250 300 350 400 450 500 550 Spontaneous 20 15 10 5 50 100 150 200 250 300 350 400 450 LabROSA Overview - Dan Ellis 500 550 slope < 1 ➝ reduction Read speech pole freq 2005-04-01 - 3 /17 Meeting Recordings with Jerry Liu and ICSI • Multi-mic recordings for speaker turns every voice reaches every mic... (?) ... but with differing coupling filters (delays, gains) • Find turns with minimal assumptions e.g. ad-hoc sensor setups (multiple PDAs) differences to remove effect of source signal - no spectral models, < 1xRT LabROSA Overview - Dan Ellis 2005-04-01 - 4 /17 Speaker Turns from Timing Diffs • Find best timing skew between mic pairs • Find clusters in high-confidence points Gaussians to each cluster, • Fit assign that class to all frames within radius ICSI0: good points All pts: nearest class All pts: closest dimension 0 0 0 -20 -20 -20 -40 -40 -40 -60 -60 -60 -80 -80 -80 -100 -100 -50 0 -100 -100 LabROSA Overview - Dan Ellis -50 0 -100 -100 -50 0 2005-04-01 - 5 /17 2. Music Signal Analysis • A lot of music data available e.g. 60G of MP3 ≈ 1000 hr of audio/15k tracks • What can we do with it? implicit definition of ‘music’ • Quality vs. quantity Speech recognition lesson: 10x data, 1/10th annotation, twice as useful • Motivating Applications music similarity / classification computer (assisted) music generation insight into music LabROSA Overview - Dan Ellis 2005-04-01 - 6 /17 Transcription as Classification with Graham Poliner • Signal models typically used for transcription harmonic spectrum, superposition • But ... trade domain knowledge for data transcription as pure classification problem: Audio Trained classifier p("C0"|Audio) p("C#0"|Audio) p("D0"|Audio) p("D#0"|Audio) p("E0"|Audio) p("F0"|Audio) single N-way discrimination for “melody” per-note classifiers for polyphonic transcription LabROSA Overview - Dan Ellis 2005-04-01 - 7 /17 Classifier Transcription Results • Trained on MIDI syntheses (32 songs) SMO SVM (Weka) • Tested on ISMIR MIREX 2003 set foreground/background separation Frame-level pitch concordance system “jazz3” overall fg+bg 71.5% 44.3% just fg 56.1% 45.4% LabROSA Overview - Dan Ellis 2005-04-01 - 8 /17 Eigenrhythms: Drum Pattern Space with John Arroyo • Pop songs built on repeating “drum loop” bass drum, snare, hi-hat small variations on a few basic patterns • Eigen-analysis (PCA) to capture variations? by analyzing lots of (MIDI) data • Applications music categorization “beat box” synthesis LabROSA Overview - Dan Ellis 2005-04-01 - 9 /17 Eigenrhythms 20+ Eigenvectors for good coverage • Need of 100 training patterns (1200 dims) • Top patterns: LabROSA Overview - Dan Ellis 2005-04-01 - 10/17 Eigenrhythms for Classification • Projections in Eigenspace / LDA space PCA(1,2) projection (16% corr) LDA(1,2) projection (33% corr) 10 6 blues 4 country disco hiphop 2 house newwave rock 0 pop punk -2 rnb 5 0 -5 -10 -20 -10 0 10 -4 -8 -6 -4 -2 • 10-way Genre classification (nearest nbr): PCA3: 20% correct LDA4: 36% correct LabROSA Overview - Dan Ellis 2005-04-01 - 11/17 0 2 3. Other Sounds: Clap Detection clapping may • Rhythmic help neural development with Nathan Lesser sensori-motor planning focus and attention metronome” • “Interactive devices give feedback on synchrony sensor-based • Classroom deployment? acoustic-based? for multiple simultaneous users?? LabROSA Overview - Dan Ellis from interactivemetronome.com 2005-04-01 - 12/17 burst for • Initial near-field “direct sound” Far-field (327MUDD ff50:4) 0.25 0 -0.25 -0.5 energy (4ms) / dB reverberation (RT60 ~ 900ms) Near-field (327MUDD nf50:4) 0.5 0 -20 -40 -60 -80 freq / kHx level • Absolute varies slopes • Decay ~ same amplitude Clap Range Discrimination 10 8 6 4 2 0 0 LabROSA Overview - Dan Ellis 0.1 0.2 0.3 0.4 0.5 0.6 0.7 time / s 0 0.1 0.2 0.3 0.4 2005-04-01 - 13/17 0.5 0.6 0.7 time / s “Personal Audio” • Easy to record everything you hear ~100GB / year @ 64 kbps • Very hard to find anything with Keansub Lee how to scan? how to visualize? how to index? • Starting point: Collect data ~ 60 hours (8 days, ~7.5 hr/day) hand-mark 139 segments (26 min/seg avg.) assign to 16 classes (8 have multiple instances) LabROSA Overview - Dan Ellis 2005-04-01 - 14/17 Features for Long Recordings • Feature frames = 1 min (not 25 ms!) • Characterize variation within each frame... Normalized Energy Deviation Average Linear Energy 120 15 100 10 80 15 40 10 20 5 5 dB Average Log Energy 60 dB Log Energy Deviation 120 15 100 10 80 20 freq / bark 20 freq / bark 60 20 freq / bark freq / bark 20 5 15 15 10 10 5 5 60 dB dB Spectral Entropy Deviation Average Spectral Entropy 0.9 0.8 15 0.7 10 5 • 0.6 0.5 bits 20 freq / bark freq / bark 20 0.5 15 0.4 10 0.3 0.2 5 0.1 50 100 150 200 250 300 350 400 450 time / min and structure within coarse auditory bands LabROSA Overview - Dan Ellis 2005-04-01 - 15/17 bits Personal Audio Applications / • Visualization browsing / diary inference link in other information sources - diary - email • NoteTaker interface: “what was I hearing?” LabROSA Overview - Dan Ellis 2005-04-01 - 16/17 LabROSA Summary • LabROSA signal processing + machine learning + information extraction • Applications Speech: Recognition, Organization Music: Transcription, Recommendation Environment: Detection, Description • Also... signal separation, compression, dolphins... LabROSA Overview - Dan Ellis 2005-04-01 - 17/17