加速以GPU為運算核心的二階段哼唱選歌 系統 ACCELERATING A TWO-STAGE QUERY BY SINGING/HUMMING SYSTEM USING GPUS Student: Andy Chuang (莊詠翔) andy.chuang@mirlab.org Advisor: Jyh-Shing Roger Jang (張智星教授) Jason S. Chang (張俊盛教授) Department of Computer Science, National Tsing Hua University Outline Introduction Related work System flow Methods Experimental results Conclusions and future work 2 /25 Introduction QBSH (Query by Singing and Humming) Description: the user sings or hums and the system returns the most similar song from database Problem: the system usually takes too long to response when database is huge Strategies Implement the new linear scaling to match the property of GPU Reduce the database to avoid the unnecessary comparison Combine linear scaling with dynamic time warping on multiple devices (GPUs) rather than one to speed up the computation 3 /25 Related Work (1/2) MIRACLE(Music Information Retrieval Acoustically with Clustered and paralleL Engines) Jang, Chen, and Kao, “MIRACLE: A Music Information Retrieval System with Clustered Computing Engines”, ISMIR 2001. Linear scaling (LS) Wang, Chen, Kuo, Chiu, and Jang, “Accelerating Query by Singing/Humming on GPU: Optimization for Web Deployment”, ICASSP 2012 Lin, “Speeding Up Query-by-Singing/Humming Systems Based on Linear Scaling”, National Tsing Hua Univ. 2012 4 /25 Related Work (2/2) Dynamic time warping (DTW) Ferraro, Hanna, Imbert, and Izart, “Accelerating Query-byHumming on GPU”, ISMIR 2009 Kuo, “Accelerating Query By Singing/Humming on GPU”, National Tsing Hua Univ. 2013 Hybrid LS+DTW Zou, “Query By Singing/Humming Using Combination of Classifiers”, National Tsing Hua Univ. 2008 Kao, “A Two-Stage Query by Singing/Humming System on GPU“, National Tsing Hua Univ. 2013 5 /25 System Flow User Sing/hum the song CPU Detect endpoints and preprocess audio Convert to framebased data GPU Perform linear scaling Reserve different amount of candidate songs Load database Show top-N song info. Post process the ranking Perform dynamic time warping 6 /25 Linear Scaling Perform key transposition before using linear scaling Example: 十年, 陳奕迅 Time 7 /25 LS Implementation Detail (1/3) Our research will focus on “Compute distance” part Scale the input pitch vector to 31 versions with different size Put input pitch vector into constant memory Sort result and return Compute distance 8 /25 LS Implementation Detail (2/3) Each block computes one song, each thread in a block computes different segments of the song An example of a single block Block dimension = 64 Segment size = 375 Frame rate = 31.25 Thread id 0 Pitch vector 0 1 ‧‧‧ 46 2 ‧‧‧ 157 Segments ‧‧‧‧‧‧ 374 ‧‧‧ 420 63 531 2034 2408 ‧‧‧ 9 /25 LS Implementation Detail (3/3) Song 0 0 ‧‧‧ 46 Song 1 ‧‧‧ 2034 ‧‧‧ 5521 ‧‧‧ 5703 Song 999 ‧‧‧ 9987 5817 ‧‧‧ 6124 567 ‧‧‧ 6124 683 ‧‧‧ 6124 784 Database Pitch Vector Block 0 Block 1 Block 999 Thread 0 Thread 0 Thread 1 Thread 1 ‧ ‧ ‧ ‧ ‧ ‧ Thread 63 Thread 63 Thread 0 ‧‧‧ Thread 1 ‧ ‧ ‧ Thread 63 10/25 Implementation of LS - method 1 Each thread copies a part of the database pitch vector from global memory to their local memory, then accesses pitches from local memory while computing Global Memory ‧‧‧ 0 46 ‧‧‧ ‧‧‧ 157 2034 ‧‧‧ Block 0 Local Memory 0 1‧‧‧ 374 Thread 0 Local Memory 46 47‧‧‧ 420 Thread 1 ‧‧‧ ‧‧‧ Local Memory ‧‧‧ 2034 2035 2408 2408 Thread 63 11/25 Implementation of LS - method 2 Threads in the same block compute the same song, so we copy the pitch vector of that song from global memory to shared memory, then each thread can access the pitch from shared memory when needed Thread 0 Global Memory 0 ‧‧‧ 46 ‧‧‧ 157 ‧‧‧ 2034 ‧‧‧ Thread 1 Block 0 Shared Memory Shared pitch 01 2 … 464748 … 2034 2035 2036 … 2034 ‧‧‧ Thread 63 12/25 /28 Comparison with 2 Method of LS Advantage ‧Intuitive ‧Easy to implement ‧Local memory is slower (off-chip) ‧Need to copy several times the same data from global memory ‧Shared memory is faster (on-chip) ‧Only need to copy the same data from global memory once ‧Bank conflict Method 1 (Using local memory) Method 2 (Using shared memory) Drawback 13/25 Bank Conflict (1/2) Bank conflict: Threads in the same half-warp (which have the same color below) access the same bank but different address of shared memory simultaneously Successive 4-bytes Data Bank id ‧‧‧ 0 1 2 3 4 5 2 2 3 3 0 1 2 3 4 5 8 9 0 1 ‧‧‧ ‧‧‧ Thread id 0 1 2 3 4 5 6 7 8 9 Bank conflict 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 No bank conflict 1 1 2 8 9 0 No bank conflict 14/25 Bank Conflict (2/2) Bank conflict between thread 2 and thread 5 Thread id 0 Pitch vector 0 1 ‧‧‧ 46 ‧‧‧ Bank id 0 ‧‧‧ 2 ‧‧‧ 157 ‧‧‧ 14 ‧‧‧ 3 ‧‧‧ 315 ‧‧‧ 29 ‧‧‧ 4 ‧‧‧ 482 ‧‧‧ ‧‧‧ 573 ‧‧‧ ‧‧‧ 27 5 2 ‧‧‧ 29 Bank conflict 15/25 Implementation of LS - method 3 Append pitches after some note to shift banks Thread id 0 Pitch vector 0 1 ‧‧‧ 46 ‧‧‧ Bank id 0 ‧‧‧ 2 ‧‧‧ 157 ‧‧‧ 14 ‧‧‧ 3 ‧‧‧ 315 ‧‧‧ 29 ‧‧‧ 4 ‧‧‧ 482 … 573 574 ‧‧‧ ‧‧‧ 27 ‧‧‧ 5 2 ‧‧‧ 30 No bank conflict Trade-off between performance & recognition rate 16/25 Database Reduce (1/3) Remove unnecessary comparison cause by the same song Save 6.2% computing time (remove 1265 songs from 20395) First-stage ‧Song length difference is less than 5 secs. ‧Pitch difference (after one-shot key transposition) is less than 2 semitone/frame-based data (pitch) Yes Have same song name & singer? No Second-stage A Second-stage B ‧Definitely the same ‧832 songs must be removed ‧Several cases ‧433 songs must be removed 17/25 Database Reduce (2/3) Cases of the first-stage’s result with different song name or singer Cases Example 1 Example 2 Song name error A Day In The Life Day In The Life Singer error Procal Harum Procol Harum Abbreviation 1st Of 5th First Of Fifth Cover 想你想斷腸 補破網 Deal with “Song name error”, “Singer error” and “Abbreviation” cases 18/25 Database Reduce (3/3) Remove unnecessary comparison cause by repeating pattern Save 22.01% size of database Using correlative matrix T 72 68 72 68 72 60 For the first row 𝑇1,𝑗 = 1, 0, 72 𝑖𝑓 𝑆1 = 𝑆𝑗 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Other 𝑇𝑖,𝑗 𝑇𝑖−1,𝑗−1 + 1, = 0, 68 72 𝑖𝑓 𝑆𝑖 = 𝑆𝑗 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 where T is the correlative matrix, S is the pitch vector i, j are the index of matrix 68 72 60 X 0 1 0 1 0 X 0 2 0 0 X 0 3 0 X 0 0 X 0 X 19/25 Multiple Device Each device will be distributed 1/n scale of database to compute the similarity to the query input, while n is the # of devices System Flow 20/25 Experimental Environment (1/2) Experimental environment (NCHC Formosa 5 Cluster) OS RAM CPU GPU Cent OS 6 x86_64 96GB DDR3 ECC Intel Xeon x5670 six cores 2.93GHz x2 NVIDIA TESLA M2070 x3 (448 cores) Language version CUDA version Compute Capability C/C++ 5.5 2.0 21/25 Experimental Data Corpus information NTHU&NTU students recordings in 2013 Corpus CHT Pop Song Format WAV, 16KHz, 16bits, mono File amount 2183 (1614 from NTU, 318 from NTHU and 251 CHT) WAV file size 8-9 seconds Database 19130 songs (after reduce) 22/25 Computation Time (sec.) Experiment 1 – LS with Different Shared Pitch Sizes & Block Dimension using Method 2 When the pitch vector size in shared memory is 2000 and the block dimension is 128, the computation time is 1.427, the shortest. 7 Block dimension: 64, 6 device: 3 Block dimension: 128, 5 device: 3 Block dimension: 256, 4 device: 3 3 Block dimension: 512, device: 3 2 Block dimension: 1024, 1 device: 3 2 3 4 5 6 7 8 9 10 11 Shared Pitch Size (K) 23/25 Experiment 2 – LS with Different Shared Pitch Sizes & Block Dimension using Method 3 When the pitch vector size in shared memory is 5000 and the block dimension is 512, the computation time is 1.213, the shortest . Computation Time (sec.) 7 Block dimension: 64, device: 3 Block dimension: 128, device: 3 Block dimension: 256, device: 3 Block dimension: 512, device: 3 Block dimension: 1024, device: 3 6 5 4 3 2 1 2 3 4 5 6 7 8 9 Shared Pitch Size (K) 10 11 24/25 Experiment 3 – LS with Different Number of Thread using Three Different Method Computation Time per Song (sec.) The best case of LS method 2 & 3 (with a certain shared pitch size) for each block dimension is faster than method 1 6 5 4 LS 1, device: 3 3 LS 2, device: 3 LS 3, device: 3 2 1 2000 11000 2000 3000 5000 128 256 512 0 64 1024 LS Block Dimension (# of threads per block) 25/25 Experiment 4– LSDTW with Different Number of Devices Computation Time per Song (sec.) If we have more devices, the computation time becomes lower since per GPU almost only needs to query 1/n scale of database 7 6.76 6 5 LS 1, block dimension: 128 5.72 5.13 4 LS 2, block dimension: 128, shared pitch size: 2000 3.52 3 3.08 2.69 2 1 1 2 Number of devices 2.41 2.12 1.84 3 LS 3, block dimension: 512, shared pitch size: 5000 26/25 Conclusions and Future Work (待補) Conclusions Computation time LS method 2 is faster than method 1, even though the bank conflict exists The Computation time is almost 1/n times while using n devices (GPU) Future work Advanced database purification to remove bad songs Abnormal melody (e.g instrumental only) Wrong melody or song name Improve LS method 2 to reduce bank conflict for Kepler architecture Different definition of bank conflict from Tesla & Fermi architecture Using different method for appending pitches 27/25 Thank you!! & DEMO (http://miracle.mirlab.org:8080/miracle) 28/25 Methods: Dynamic Time Warping 𝑡: input pitch vector 𝑟: reference pitch vector Local paths: 27-45-63 degrees DTW recurrence: j D (i , j ) r(j) D(i, j ) | t (i ) r ( j ) | r(j-1) min t(i-1) t(i) i 29/25 Note & Pitch data charts Average number of notes: 913.32, σ : 450.99 Average number of pitch data: 5849.94, σ: 1878.42 30/25 Note & Pitch data charts (After Repeating Pattern Removing) Average number of notes: 694.7, σ : 386 Average number of pitch data: 4562.54, σ: 1773.6 31/25 Detail of Second-stage B Song name Singer Different Same 處理方式 Manually check Different Unknown 1. Compute two song names’ edit distance 2. Check the song which edit distance is smaller than threshold Same Different Manually check Same Unknown Remove data with unknown singer 32/25 System Flow with multiple devices (待補) User Sing/hum the song CPU Detect endpoints and preprocess audio Convert to framebased data GPU Perform linear scaling Reserve different amount of candidate songs Load database Show top-N song info. Post process the ranking Perform dynamic time warping 33/25 以下是高瑋的 備用投影片 35/25 Method: Borda Count M Borda Count Dk ( R rik ) i 1 R: # of candidate song M: # of melody recognition method rik: the rank which is k-th result in i-th melody recognition method Borda Count example Rank 1-ABCD (score) Rank 2-DCAB (score) (total score) A 3 1 4 1 B 2 0 2 4 C 1 2 3 2 D 0 3 3 2 Song Dk Rank 41/25 Methods: Comparison of Two Methods Type Linear scaling (LS) Dynamic time warping (DTW) Computation time Faster Slower Tempo variation Deal with uniform tempo variation Deal with non-uniform tempo variation Key transposition One-shot Heuristic search The proposed system combines linear scaling with dynamic time warping to accelerate computation. 42/25