Feature Extraction in Content-Based Musical Instrument Recognition

advertisement
Music Information Retrieval
Social and Commercial Implication

Enormous amount of recorded music
throughout the world
 Over
10,000 new CDs are released every year
 the search for music - specifically, in the MP3 format –
is the most popular retrieval request
 In the U.S. alone, 1.08 billion units of recorded music
(e.g., CDs, cassettes, music videos, etc), valued at
$14.3 billion, were shipped to retailers in the year
2000
Traditional Music IR mechanism

Text based examples
 Peer-to-peer
file sharing software
 FTP
 Streaming
audio
 Websites
 Online
network drives
 Clip Art
text
mp3
Content-based MIR

Music Information Retrieval (MIR)
Traditional MIR systems
Content-Based MIR systems
• Based on symbolic
representation: title, lyrics,
composer, performer, singer
• Based on high level music
content: timbre, melody,
rhythm, genre, mood
• Use similar techniques as
text retrieval
• Content-based recognition
techniques: Feature extraction
and Pattern recognition
Disadvantages:
• The information is not enough to represent the content of the music
• There are enormous amounts of music without any textual information
Scenario 1
Timbre recognition
Singer recognition
Melody recognition
Rhythm recognition
Genre recognition
Mood recognition
Previous preferences
Chopin’s Scherzo
– a delightful piano melody
......
query
until
Place an order
for the final
selection and
get a full piece
Summaries of ten
best candidates
MIR System
Timbre recognition
Singer recognition
Melody recognition
Rhythm recognition
Genre recognition
Mood recognition
A series of other
processes
Music on Internet
Music in databases
Scenario 2
Listen to the audio program
“Music of the World”
Cut the jazz pieces
from the audio stream
Jazz
bibliography
Searching Bibliography
Music
Jazz
Classical
Pop
R&B
Bach Beethoven Mozart
Maintain a categorical archive
Rock
Content-based MIR

Advantages
 Quick
and easy searching
 Less problems with text labeling

Disadvantages
 Difficulties
in sound recognition
Philips Audio Fingerprinting

Audio Fingerprinting
 Microphone

device captures your voice
“Audio fingerprints” are determined
 Melody
query then is sent to
a database
 Results are returned
 Good for finding a song you
don’t know by name but know
by tune
Jaap Haitsma and Ton Kalker. “A Highly Robust Audio
Fingerprinting System”. Proceedings of ISMIR 2002, Paris,
France, October 2002.
Philips Audio Fingerprinting

Applications
 music
recognition over mobile phones
 build broadcast monitoring systems for



copyright verification
commercial verification
royalty metering
 maintaining
personal music archives
 allows the creation of music-aware networks, allowing
reliable control over the flow of copyrighted music
 an essential tool for building more secure digital audio
watermarks
Philips Audio Fingerprinting

Performance
 Efficient


Typically 3 seconds of music is enough to identify the music
Fast
 Robust

Is not affacted by compression or environmental noise
 Highly

accurate
Can discriminate between different versions of a song, even
if performed by the same artist
Philips Audio Fingerprinting

Main Challenges

Unique features


Clever searching


These features are loosely analogous to normal fingerprints for
human beings
Algorithms to compare these audio fingerprints to large databases
of previously extracted audio fingerprints
Other Challenges




Must be able to work with short segments of music
The music that is offered to the fingerprint extractor is of poor quality
The easiest way to lose customers is to provide poor performance
the cell-phone recognition service must make economical sense
http://www.research.philips.com/InformationCenter/Global/FArticleSummary.asp?lNodeId=927
SoundFisher
Keislar, D., T. Blum, T., J. Wheaton, & E. Wold,. “A content-ware sound browser”.
Proc. of the International Computer Music Conference, ICMA, 1999.
http://www.soundfisher.com
Musle Fish
SoundFisher
http://www.soundfisher.com
Musle Fish
SoundFisher
http://www.soundfisher.com
Musle Fish
SoundFisher
http://www.soundfisher.com
Musle Fish
SoundFisher

Supports Traditional Text and Numeric Queries



Performs text searches on keywords, comments, composer or
performer names, and user-defined fields.
Searches and sorts by file attributes, such as sample rate, number
of channels, etc.
Powerful "Sounds-like" Queries




Musle Fish
Finds sounds by example using selected sounds or user-defined
sound categories.
Searches and sorts sounds by similarity, based on pitch, loudness,
brightness and/or overall timbre
But, SoundFisher is not intended as a search mechanism for
music catalogs. In other words, they do not address sound at the
level of the musical phrase, melody, rhythm or tempo.
Permits Flexible Categories


Supports sound categories that contain nested categories.
No need to conform to predefined categories.
IRCAM Studio-On-Line


An content-based
search &
classification
interface
The "Sound Palette"
provides access to an
instrumental sound
database, as well as
a Web site on
Instruments and their
playing modes.
IRCAM Studio-On-Line

Offers a primitive search-by-perceptual-similarity function


Offers manual and automatic sound labeling and
classification


Search available sounds through high-level criteria (sound
categories, dynamic profiles, timbre similarity, etc.)
Learns classification criteria from user-provided sample training
sets, and then performs automatic classification of newly entered
samples among the learned classes
Access to audio material in various formats

mp3 streaming, audio files, and compressed archives download
Music Recognition
Music Recognition
Unique features
Clever searching
Audio Fingerprinting
"Sounds-like" queries
and categories
SoundFisher
search-by-perceptualsimilarity






Timbre recognition
Singer recognition
Melody recognition
Rhythm recognition
Genre recognition
Mood recognition
Studio-On-Line
Challenges:
• Multirepresentational Challenge
• Multicultural Challenge
• Multiexperiential Challenge
Music Recognition

Timbre recognition



Find solo violin pieces
Which works use the following combination of instruments?
Singer recognition



Find songs of Sting
Who sings the song I just played?
I want to find some songs for karaoke, where the singer’s voice
is similar to my own
Music Recognition

Melody recognition


Rhythm recognition


Find songs with a dance rhythm
Genre recognition


Find songs with a similar melody to what I am listening to right
now
Find song whose style is similar with the one I am listening to
now
Mood recognition

Find a soft song to calm myself after a hard day at UST (Univ. of
Stress & Tension)
Music Recognition

Automatic Music Transcription Systems


Music Annotation and Indexing


Segment audio streams and assign symbols to indicate the
content of the segment.
MPEG-7 descriptors


Transcribe soundfiles to high level musical notation (MIDI files or
sheet scores)
The MPEG description consists of semantic descriptors (e.g.,
type of music), and perceptual features describing the audio
content.
Video Indexing and Retrieval

Audio is an important component in video indexing and retrieval.
Timbre Recognition
Timbre

The four basic perceptual attributes of sound




Pitch / Fundamental frequency
Loudness / Amplitude
Duration
Timbre
Definition:
 Timbre is the quality of a sound by which a listener can judge that
two sounds of the same loudness and pitch are dissimilar.

Instrument Recognition

Recognize instrument families and individual instruments
Five classes of instruments

The strings

Instruments with strings that are played by touching them with a
bow or plectrum.


The brass

Wind instruments made of brass


Wind instruments often made of wood (flutes and reeds)

Double reeds, clarinets, saxophones, flutes, piccolo, bassoon
The percussion


trumpet, trombone, French horn, tuba
The woodwinds


violin, viola, cello, double bass, guitar
Timpani, marimba, drums, cymbal, gong
The keyboard

Harpsichord, piano, organ
What to Recognize
instrument family
individual instrument
AND
Timbre Recognition

Instrument families


Strings, brass, woodwinds, percussion, keyboard
A taxonomic hierarchy
All Instruments
Released
Sustained
Brass or Reeds
Piano
Piano
Pizzicato
strings
Bowed
strings
Guitar
Violin
Viola
Cello
Double Bass
Violin
Viola
Cello
Double Bass
Flute or
Piccolo
Flute
Alto flute
Bass flute
Piccolo
Reeds
Oboe
English horn
Bassoon
Clarinet
Saxophone
Brass
Trumpet
French horn
Trombone
Tuba
Recognition Systems
Human Recognition System

Little is know about the human sound source
recognition system
sensory transduction
McAdam’s model
of human auditory
processing
auditory grouping
analysis of features
matching with lexicon
meaning&significance
lexicon of names
recognition
Recognition Systems
S(n)
Preprocessing
Training
Feature
Extractor
Multi-level
Model Training
representation
Instrument model
Temporal features
Spectral features
Cepstral features
Classifier
Other features
Classification
Timbre
Recognition Systems
Monophonic recognition
Polyphonic recognition
Single note, professionally recorded
or synthesized with high fidelity
Overlapped sounds of different
instruments played together, a duet, a
trio, or an orchestral piece
Simple, but includes most of the
fundamental techniques
Difficult, more complex techniques such
as pitch tracking and source separation
needed
Can be used to evaluate timbre
More practical and useful since most of
features, because timbre is especially the music recordings are polyphonic
obvious when there is only one note
Many evaluation sample collections
exist, but still incomplete
No good sample collections. Usually
evaluated with very small dataset
Recognition Systems

Evaluation Criteria

Accuracy


Generality


The system should ideally be able to handle real world sounds with
noise, reverberation, and competing sound sources.
Scalability


The recognition should not depend on a particular performer and
the particular acoustic environment.
Robustness


The system should be able to recognize different kinds of
instruments with high accuracy.
The system should be able to accept a new sound source and
learn to recognize it without decreasing the system performance.
When new sound sources are continually introduced to the system,
the performance should decrease gradually.
Realtime

The system should be able to recognize a source in realtime
Recognition Systems

Evaluation data collections

Monophonic collections


Using one sample collection in the evaluation
Using several sample collections in the evaluation
McGill University Master Samples (MUMS)
University of Iowa Musical Instrument Samples
IRCAM Studio-On-Line Samples (IRCAM SOL)
RWC Music Database

No good sample collections for polyphonic music

Researchers have used their own data collections in the evaluation.
Use single data collection in the evaluation
Features
Accuracy
Evaluation data
Kaminskyi &
Voumard 96
7
98%
19 instruments: the instruments are very different and
note range is small
Martin and
Kim 98
31
70% / 90%
1023 sounds of 14 instruments in McGill collection
Fujinaga 98
7
50.3%
Fujinaga 99
20
64%
Over 1300 sounds of 39 timbres from 23 instruments in
McGill collection
Fujinaga 00
22
68%
Eronen &
Klapuri 00
43
80% / 94%
1498 sounds of 30 instruments from McGill collection
Petters &
Rodet
81
86% / 89%
1400 sounds of 14 instruments from IRCAM SOL
27
81% / 87%
Use multiple data collections in the evaluation
Martin 99
31
39% / 76%
1500 sounds of 27 instruments from three sources: McGill,
MIT music Library’s compact disc collection, and
recordings made especially for this project
Eronen 01
38
35% / 77%
5286 sounds 0f 29 instruments from five sources: McGill,
Tampere guitar collection, UIowa, IRCAM SOL, and
Roland XP-30 synthesizer
Livshine,
Petters &
Rodet
N/A
N/A
1325 sounds of 16 instruments from five sources: IRCAM
SOL, UIowa, McGill, Prosonus and Vitus collections
Feature Extraction
Feature Extraction
An audio clip
A violin note
Temporal Features
s1
s2
s3
sn
sM
DFT
S1
S2
S3
Sn
SM
Spectral Features
Cepstral Featuers
Temporal Features

Frame features
 Amplitude



& Loudness
Root Mean Square
RMS (n) 
Short time Energy
STE (n) 
1 N 1 2
 Sn (i)
N i 0
1
N
N 1
[S
i 0
n
Clip Features
 Combination

of frame features
Mean and standard deviation of RMS
(i ) w( N  1  i )]2
Feature Extraction
An audio clip
A violin note
Temporal Features
s1
s2
s3
sn
sM
DFT
S1
S2
S3
Sn
SM
Spectral Features
Cepstral Featuers
Spectral Envelope


The shape of the spectral envelope is closely related
to timbre
Spectral features can describe some of the spectral
envelope
Sn(ω)
ω
Spectral Features
Sn(ω)
ω

Spectral Moments

First Order Moment / Frequency Centroid

Center frequency weighted by squared amplitude
2

S
i0 i n (i )
N
Mk (n) 
k
2
S
i0 n (i )
N
Spectral Features
Sn(ω)
ω

Spectral Centroid Moments



Weighted average difference between spectral components
and frequency centroid
Band-width: square root of the second order centroid moment
Skewness: third order centroid moment
k 2
(



)
i0 i M Sn (i )
N
Ck (n) 
2
S
i0 n (i )
N
Spectral Features
Sn(ω)
ω
1/8

1/8
1/4
1/2
Subband Energy and Subband Energy Ratio

Analogous to frequency bands in human ears
 Represent the energy distribution of the spectrum
Hj
E j (n)  log(  S n (i ))
Lj
and
ER j (n) 
E j ( n)

j
E j ( n)
Spectral Features
Sn(ω)
ω

Spectral Irregularity

Represents the jaggedness of spectral envelope

I ( n) 
N 2
i 0
( S n (i)  S n (i  1)) 2

N 1
i 0
S n2 (i)
Spectral Features
Sn(ω)
ω

Formant Features

Formant Frequency & Formant Amplitude



The position and amplitude of first two formants are the most
important
Pitch : Fundamental frequency
Tristimulus

The percentage of the low-order formants compared to the higher
ones
Cepstral Coefficients

Source-filter Model
Source signal


Filter
Output signal
Source: periodic excitation of strings
Filter: the resonator, body of an instrument
white noise
filter spectrum
signal spectrum
excitation spectrum


The shape of the filter spectrum represents the spectral envelope
How to extract the filter properties ― Cepstral Coefficients
Feature Extraction
An audio clip
A violin note
Temporal Features
s1
s2
s3
sn
sM
DFT
S1
S2
S3
Sn
SM
Spectral Features
Cepstral Featuers
Mel-frequency Cepstral Coefficients
Signal s

Mel Scaling

Preprocessing
Human auditory system perceives sound
logarithmically
Frame separating
Windowing
DFT
Spectrum
Mel-scaling

Discrete Cosine Transform (DCT)

Logarithm
DCT
Cepstrum
DCT is taken to separate the filter and
excitation properties.
 The low-order cepstrum is the compact
representation of the filter
Feature Evaluation
Feature Evaluation –Sample Collections

Evaluated by sample collections




Build a recognition system for each evaluated feature or feature
set
Use sample collections to calculate the system performance
Evaluate features by the system performance (accuracy)
Advantages


Easy to carry out
There are free sample collections
Feature Evaluation –Sample Collections

Disadvantages

Diversity of the Music




Some sample collections don’t have same properties as other
sample collections
Using not enough sample collections decreases the generality of
the system. The accuracy of these systems is skewed.
These systems do not satisfy the generality criterion
Since we do not know how many sample collections are needed,
it may not be reliable to use incomplete sample collections to:


Evaluate a recognition system
Evaluate the effectiveness of a feature
Download