Music Information Retrieval Social and Commercial Implication Enormous amount of recorded music throughout the world Over 10,000 new CDs are released every year the search for music - specifically, in the MP3 format – is the most popular retrieval request In the U.S. alone, 1.08 billion units of recorded music (e.g., CDs, cassettes, music videos, etc), valued at $14.3 billion, were shipped to retailers in the year 2000 Traditional Music IR mechanism Text based examples Peer-to-peer file sharing software FTP Streaming audio Websites Online network drives Clip Art text mp3 Content-based MIR Music Information Retrieval (MIR) Traditional MIR systems Content-Based MIR systems • Based on symbolic representation: title, lyrics, composer, performer, singer • Based on high level music content: timbre, melody, rhythm, genre, mood • Use similar techniques as text retrieval • Content-based recognition techniques: Feature extraction and Pattern recognition Disadvantages: • The information is not enough to represent the content of the music • There are enormous amounts of music without any textual information Scenario 1 Timbre recognition Singer recognition Melody recognition Rhythm recognition Genre recognition Mood recognition Previous preferences Chopin’s Scherzo – a delightful piano melody ...... query until Place an order for the final selection and get a full piece Summaries of ten best candidates MIR System Timbre recognition Singer recognition Melody recognition Rhythm recognition Genre recognition Mood recognition A series of other processes Music on Internet Music in databases Scenario 2 Listen to the audio program “Music of the World” Cut the jazz pieces from the audio stream Jazz bibliography Searching Bibliography Music Jazz Classical Pop R&B Bach Beethoven Mozart Maintain a categorical archive Rock Content-based MIR Advantages Quick and easy searching Less problems with text labeling Disadvantages Difficulties in sound recognition Philips Audio Fingerprinting Audio Fingerprinting Microphone device captures your voice “Audio fingerprints” are determined Melody query then is sent to a database Results are returned Good for finding a song you don’t know by name but know by tune Jaap Haitsma and Ton Kalker. “A Highly Robust Audio Fingerprinting System”. Proceedings of ISMIR 2002, Paris, France, October 2002. Philips Audio Fingerprinting Applications music recognition over mobile phones build broadcast monitoring systems for copyright verification commercial verification royalty metering maintaining personal music archives allows the creation of music-aware networks, allowing reliable control over the flow of copyrighted music an essential tool for building more secure digital audio watermarks Philips Audio Fingerprinting Performance Efficient Typically 3 seconds of music is enough to identify the music Fast Robust Is not affacted by compression or environmental noise Highly accurate Can discriminate between different versions of a song, even if performed by the same artist Philips Audio Fingerprinting Main Challenges Unique features Clever searching These features are loosely analogous to normal fingerprints for human beings Algorithms to compare these audio fingerprints to large databases of previously extracted audio fingerprints Other Challenges Must be able to work with short segments of music The music that is offered to the fingerprint extractor is of poor quality The easiest way to lose customers is to provide poor performance the cell-phone recognition service must make economical sense http://www.research.philips.com/InformationCenter/Global/FArticleSummary.asp?lNodeId=927 SoundFisher Keislar, D., T. Blum, T., J. Wheaton, & E. Wold,. “A content-ware sound browser”. Proc. of the International Computer Music Conference, ICMA, 1999. http://www.soundfisher.com Musle Fish SoundFisher http://www.soundfisher.com Musle Fish SoundFisher http://www.soundfisher.com Musle Fish SoundFisher http://www.soundfisher.com Musle Fish SoundFisher Supports Traditional Text and Numeric Queries Performs text searches on keywords, comments, composer or performer names, and user-defined fields. Searches and sorts by file attributes, such as sample rate, number of channels, etc. Powerful "Sounds-like" Queries Musle Fish Finds sounds by example using selected sounds or user-defined sound categories. Searches and sorts sounds by similarity, based on pitch, loudness, brightness and/or overall timbre But, SoundFisher is not intended as a search mechanism for music catalogs. In other words, they do not address sound at the level of the musical phrase, melody, rhythm or tempo. Permits Flexible Categories Supports sound categories that contain nested categories. No need to conform to predefined categories. IRCAM Studio-On-Line An content-based search & classification interface The "Sound Palette" provides access to an instrumental sound database, as well as a Web site on Instruments and their playing modes. IRCAM Studio-On-Line Offers a primitive search-by-perceptual-similarity function Offers manual and automatic sound labeling and classification Search available sounds through high-level criteria (sound categories, dynamic profiles, timbre similarity, etc.) Learns classification criteria from user-provided sample training sets, and then performs automatic classification of newly entered samples among the learned classes Access to audio material in various formats mp3 streaming, audio files, and compressed archives download Music Recognition Music Recognition Unique features Clever searching Audio Fingerprinting "Sounds-like" queries and categories SoundFisher search-by-perceptualsimilarity Timbre recognition Singer recognition Melody recognition Rhythm recognition Genre recognition Mood recognition Studio-On-Line Challenges: • Multirepresentational Challenge • Multicultural Challenge • Multiexperiential Challenge Music Recognition Timbre recognition Find solo violin pieces Which works use the following combination of instruments? Singer recognition Find songs of Sting Who sings the song I just played? I want to find some songs for karaoke, where the singer’s voice is similar to my own Music Recognition Melody recognition Rhythm recognition Find songs with a dance rhythm Genre recognition Find songs with a similar melody to what I am listening to right now Find song whose style is similar with the one I am listening to now Mood recognition Find a soft song to calm myself after a hard day at UST (Univ. of Stress & Tension) Music Recognition Automatic Music Transcription Systems Music Annotation and Indexing Segment audio streams and assign symbols to indicate the content of the segment. MPEG-7 descriptors Transcribe soundfiles to high level musical notation (MIDI files or sheet scores) The MPEG description consists of semantic descriptors (e.g., type of music), and perceptual features describing the audio content. Video Indexing and Retrieval Audio is an important component in video indexing and retrieval. Timbre Recognition Timbre The four basic perceptual attributes of sound Pitch / Fundamental frequency Loudness / Amplitude Duration Timbre Definition: Timbre is the quality of a sound by which a listener can judge that two sounds of the same loudness and pitch are dissimilar. Instrument Recognition Recognize instrument families and individual instruments Five classes of instruments The strings Instruments with strings that are played by touching them with a bow or plectrum. The brass Wind instruments made of brass Wind instruments often made of wood (flutes and reeds) Double reeds, clarinets, saxophones, flutes, piccolo, bassoon The percussion trumpet, trombone, French horn, tuba The woodwinds violin, viola, cello, double bass, guitar Timpani, marimba, drums, cymbal, gong The keyboard Harpsichord, piano, organ What to Recognize instrument family individual instrument AND Timbre Recognition Instrument families Strings, brass, woodwinds, percussion, keyboard A taxonomic hierarchy All Instruments Released Sustained Brass or Reeds Piano Piano Pizzicato strings Bowed strings Guitar Violin Viola Cello Double Bass Violin Viola Cello Double Bass Flute or Piccolo Flute Alto flute Bass flute Piccolo Reeds Oboe English horn Bassoon Clarinet Saxophone Brass Trumpet French horn Trombone Tuba Recognition Systems Human Recognition System Little is know about the human sound source recognition system sensory transduction McAdam’s model of human auditory processing auditory grouping analysis of features matching with lexicon meaning&significance lexicon of names recognition Recognition Systems S(n) Preprocessing Training Feature Extractor Multi-level Model Training representation Instrument model Temporal features Spectral features Cepstral features Classifier Other features Classification Timbre Recognition Systems Monophonic recognition Polyphonic recognition Single note, professionally recorded or synthesized with high fidelity Overlapped sounds of different instruments played together, a duet, a trio, or an orchestral piece Simple, but includes most of the fundamental techniques Difficult, more complex techniques such as pitch tracking and source separation needed Can be used to evaluate timbre More practical and useful since most of features, because timbre is especially the music recordings are polyphonic obvious when there is only one note Many evaluation sample collections exist, but still incomplete No good sample collections. Usually evaluated with very small dataset Recognition Systems Evaluation Criteria Accuracy Generality The system should ideally be able to handle real world sounds with noise, reverberation, and competing sound sources. Scalability The recognition should not depend on a particular performer and the particular acoustic environment. Robustness The system should be able to recognize different kinds of instruments with high accuracy. The system should be able to accept a new sound source and learn to recognize it without decreasing the system performance. When new sound sources are continually introduced to the system, the performance should decrease gradually. Realtime The system should be able to recognize a source in realtime Recognition Systems Evaluation data collections Monophonic collections Using one sample collection in the evaluation Using several sample collections in the evaluation McGill University Master Samples (MUMS) University of Iowa Musical Instrument Samples IRCAM Studio-On-Line Samples (IRCAM SOL) RWC Music Database No good sample collections for polyphonic music Researchers have used their own data collections in the evaluation. Use single data collection in the evaluation Features Accuracy Evaluation data Kaminskyi & Voumard 96 7 98% 19 instruments: the instruments are very different and note range is small Martin and Kim 98 31 70% / 90% 1023 sounds of 14 instruments in McGill collection Fujinaga 98 7 50.3% Fujinaga 99 20 64% Over 1300 sounds of 39 timbres from 23 instruments in McGill collection Fujinaga 00 22 68% Eronen & Klapuri 00 43 80% / 94% 1498 sounds of 30 instruments from McGill collection Petters & Rodet 81 86% / 89% 1400 sounds of 14 instruments from IRCAM SOL 27 81% / 87% Use multiple data collections in the evaluation Martin 99 31 39% / 76% 1500 sounds of 27 instruments from three sources: McGill, MIT music Library’s compact disc collection, and recordings made especially for this project Eronen 01 38 35% / 77% 5286 sounds 0f 29 instruments from five sources: McGill, Tampere guitar collection, UIowa, IRCAM SOL, and Roland XP-30 synthesizer Livshine, Petters & Rodet N/A N/A 1325 sounds of 16 instruments from five sources: IRCAM SOL, UIowa, McGill, Prosonus and Vitus collections Feature Extraction Feature Extraction An audio clip A violin note Temporal Features s1 s2 s3 sn sM DFT S1 S2 S3 Sn SM Spectral Features Cepstral Featuers Temporal Features Frame features Amplitude & Loudness Root Mean Square RMS (n) Short time Energy STE (n) 1 N 1 2 Sn (i) N i 0 1 N N 1 [S i 0 n Clip Features Combination of frame features Mean and standard deviation of RMS (i ) w( N 1 i )]2 Feature Extraction An audio clip A violin note Temporal Features s1 s2 s3 sn sM DFT S1 S2 S3 Sn SM Spectral Features Cepstral Featuers Spectral Envelope The shape of the spectral envelope is closely related to timbre Spectral features can describe some of the spectral envelope Sn(ω) ω Spectral Features Sn(ω) ω Spectral Moments First Order Moment / Frequency Centroid Center frequency weighted by squared amplitude 2 S i0 i n (i ) N Mk (n) k 2 S i0 n (i ) N Spectral Features Sn(ω) ω Spectral Centroid Moments Weighted average difference between spectral components and frequency centroid Band-width: square root of the second order centroid moment Skewness: third order centroid moment k 2 ( ) i0 i M Sn (i ) N Ck (n) 2 S i0 n (i ) N Spectral Features Sn(ω) ω 1/8 1/8 1/4 1/2 Subband Energy and Subband Energy Ratio Analogous to frequency bands in human ears Represent the energy distribution of the spectrum Hj E j (n) log( S n (i )) Lj and ER j (n) E j ( n) j E j ( n) Spectral Features Sn(ω) ω Spectral Irregularity Represents the jaggedness of spectral envelope I ( n) N 2 i 0 ( S n (i) S n (i 1)) 2 N 1 i 0 S n2 (i) Spectral Features Sn(ω) ω Formant Features Formant Frequency & Formant Amplitude The position and amplitude of first two formants are the most important Pitch : Fundamental frequency Tristimulus The percentage of the low-order formants compared to the higher ones Cepstral Coefficients Source-filter Model Source signal Filter Output signal Source: periodic excitation of strings Filter: the resonator, body of an instrument white noise filter spectrum signal spectrum excitation spectrum The shape of the filter spectrum represents the spectral envelope How to extract the filter properties ― Cepstral Coefficients Feature Extraction An audio clip A violin note Temporal Features s1 s2 s3 sn sM DFT S1 S2 S3 Sn SM Spectral Features Cepstral Featuers Mel-frequency Cepstral Coefficients Signal s Mel Scaling Preprocessing Human auditory system perceives sound logarithmically Frame separating Windowing DFT Spectrum Mel-scaling Discrete Cosine Transform (DCT) Logarithm DCT Cepstrum DCT is taken to separate the filter and excitation properties. The low-order cepstrum is the compact representation of the filter Feature Evaluation Feature Evaluation –Sample Collections Evaluated by sample collections Build a recognition system for each evaluated feature or feature set Use sample collections to calculate the system performance Evaluate features by the system performance (accuracy) Advantages Easy to carry out There are free sample collections Feature Evaluation –Sample Collections Disadvantages Diversity of the Music Some sample collections don’t have same properties as other sample collections Using not enough sample collections decreases the generality of the system. The accuracy of these systems is skewed. These systems do not satisfy the generality criterion Since we do not know how many sample collections are needed, it may not be reliable to use incomplete sample collections to: Evaluate a recognition system Evaluate the effectiveness of a feature