Miracle

advertisement
加速以GPU為運算核心的二階段哼唱選歌
系統
ACCELERATING A TWO-STAGE QUERY BY
SINGING/HUMMING SYSTEM USING GPUS
Student: Andy Chuang (莊詠翔)
andy.chuang@mirlab.org
Advisor: Jyh-Shing Roger Jang (張智星教授)
Jason S. Chang (張俊盛教授)
Department of Computer Science, National Tsing Hua University
Outline
Introduction
 Related work
 System flow
 Methods
 Experimental results
 Conclusions and future work

2
/25
Introduction
 QBSH
(Query by Singing and Humming)
Description: the user sings or hums and the system returns the
most similar song from database
 Problem: the system usually takes too long to response when
database is huge

 Strategies
Implement the new linear scaling to match the property of
GPU
 Reduce the database to avoid the unnecessary comparison
 Combine linear scaling with dynamic time warping on multiple
devices (GPUs) rather than one to speed up the computation

3
/25
Related Work (1/2)
MIRACLE(Music Information Retrieval
Acoustically with Clustered and paralleL Engines)



Jang, Chen, and Kao, “MIRACLE: A Music Information
Retrieval System with Clustered Computing Engines”,
ISMIR 2001.
Linear scaling (LS)


Wang, Chen, Kuo, Chiu, and Jang, “Accelerating Query
by Singing/Humming on GPU: Optimization for Web
Deployment”, ICASSP 2012
Lin, “Speeding Up Query-by-Singing/Humming Systems
Based on Linear Scaling”, National Tsing Hua Univ. 2012
4
/25
Related Work (2/2)

Dynamic time warping (DTW)



Ferraro, Hanna, Imbert, and Izart, “Accelerating Query-byHumming on GPU”, ISMIR 2009
Kuo, “Accelerating Query By Singing/Humming on GPU”,
National Tsing Hua Univ. 2013
Hybrid LS+DTW


Zou, “Query By Singing/Humming Using Combination of
Classifiers”, National Tsing Hua Univ. 2008
Kao, “A Two-Stage Query by Singing/Humming System on
GPU“, National Tsing Hua Univ. 2013
5
/25
System Flow
User
Sing/hum
the song
CPU
Detect endpoints
and preprocess
audio
Convert to framebased data
GPU
Perform linear
scaling
Reserve different
amount of
candidate songs
Load database
Show top-N
song info.
Post process the
ranking
Perform dynamic
time warping
6
/25
Linear Scaling
Perform key transposition before using linear scaling
Example: 十年, 陳奕迅
Time
7
/25
LS Implementation Detail (1/3)

Our research will focus on “Compute distance” part
Scale the input
pitch vector to 31
versions with
different size
Put input pitch
vector into constant
memory
Sort result and
return
Compute distance
8
/25
LS Implementation Detail (2/3)
Each block computes one song, each thread in a block
computes different segments of the song
 An example of a single block

 Block
dimension = 64
 Segment size = 375
 Frame rate = 31.25
Thread id
0
Pitch vector
0
1
‧‧‧
46
2
‧‧‧
157
Segments
‧‧‧‧‧‧
374 ‧‧‧
420
63
531
2034
2408
‧‧‧
9
/25
LS Implementation Detail (3/3)
Song 0
0
‧‧‧
46
Song 1
‧‧‧
2034
‧‧‧
5521
‧‧‧
5703
Song 999
‧‧‧
9987
5817
‧‧‧
6124
567
‧‧‧
6124
683
‧‧‧
6124
784
Database Pitch Vector
Block 0
Block 1
Block 999
Thread 0
Thread 0
Thread 1
Thread 1
‧
‧
‧
‧
‧
‧
Thread 63
Thread 63
Thread 0
‧‧‧
Thread 1
‧
‧
‧
Thread 63
10/25
Implementation of LS - method 1

Each thread copies a part of the database pitch vector
from global memory to their local memory, then
accesses pitches from local memory while computing
Global Memory
‧‧‧
0
46
‧‧‧
‧‧‧
157
2034
‧‧‧
Block 0
Local Memory
0
1‧‧‧
374
Thread 0
Local Memory
46
47‧‧‧
420
Thread 1
‧‧‧
‧‧‧
Local Memory
‧‧‧
2034 2035
2408
2408
Thread 63
11/25
Implementation of LS - method 2

Threads in the same block compute the same song, so
we copy the pitch vector of that song from global
memory to shared memory, then each thread can access
the pitch from shared memory when needed
Thread 0
Global Memory
0
‧‧‧
46
‧‧‧
157
‧‧‧
2034
‧‧‧
Thread 1
Block 0
Shared Memory
Shared
pitch
01 2 …
464748
…
2034
2035
2036 …
2034
‧‧‧
Thread 63
12/25
/28
Comparison with 2 Method of LS
Advantage
‧Intuitive
‧Easy to implement
‧Local memory is slower
(off-chip)
‧Need to copy several
times the same data from
global memory
‧Shared memory is faster
(on-chip)
‧Only need to copy the
same data from global
memory once
‧Bank conflict
Method 1
(Using local memory)
Method 2
(Using shared memory)
Drawback
13/25
Bank Conflict (1/2)

Bank conflict: Threads in the same half-warp (which
have the same color below) access the same bank but
different address of shared memory simultaneously
Successive
4-bytes
Data
Bank id
‧‧‧
0 1 2 3 4 5
2 2 3 3
0 1 2 3 4 5
8 9 0 1
‧‧‧
‧‧‧
Thread id
0 1 2 3 4 5 6 7 8 9
Bank conflict
1 1 1 1 1 1 1 1
0 1 2 3 4 5 6 7
No bank conflict
1 1 2
8 9 0
No bank conflict
14/25
Bank Conflict (2/2)

Bank conflict between thread 2 and thread 5
Thread id
0
Pitch vector
0
1
‧‧‧
46
‧‧‧
Bank id
0
‧‧‧
2
‧‧‧
157
‧‧‧
14
‧‧‧
3
‧‧‧
315
‧‧‧
29
‧‧‧
4
‧‧‧
482
‧‧‧
‧‧‧
573
‧‧‧
‧‧‧
27
5
2
‧‧‧
29
Bank conflict
15/25
Implementation of LS - method 3

Append pitches after some note to shift banks
Thread id
0
Pitch vector
0
1
‧‧‧
46
‧‧‧
Bank id
0
‧‧‧
2
‧‧‧
157
‧‧‧
14
‧‧‧
3
‧‧‧
315
‧‧‧
29
‧‧‧
4
‧‧‧
482 … 573 574
‧‧‧
‧‧‧
27
‧‧‧
5
2
‧‧‧
30
No bank conflict

Trade-off between performance & recognition rate
16/25
Database Reduce (1/3)
Remove unnecessary comparison cause by the same song
 Save 6.2% computing time (remove 1265 songs from 20395)

First-stage
‧Song length difference is less than 5 secs.
‧Pitch difference (after one-shot key transposition) is less than
2 semitone/frame-based data (pitch)
Yes
Have same song name & singer?
No
Second-stage A
Second-stage B
‧Definitely the same
‧832 songs must be removed
‧Several cases
‧433 songs must be removed
17/25
Database Reduce (2/3)


Cases of the first-stage’s result with different song name
or singer
Cases
Example 1
Example 2
Song name error
A Day In The Life
Day In The Life
Singer error
Procal Harum
Procol Harum
Abbreviation
1st Of 5th
First Of Fifth
Cover
想你想斷腸
補破網
Deal with “Song name error”, “Singer error” and
“Abbreviation” cases
18/25
Database Reduce (3/3)
Remove unnecessary comparison cause by repeating pattern
 Save 22.01% size of database
 Using correlative matrix T
72 68 72 68 72 60


For the first row
𝑇1,𝑗 =

1,
0,
72
𝑖𝑓 𝑆1 = 𝑆𝑗
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Other
𝑇𝑖,𝑗
𝑇𝑖−1,𝑗−1 + 1,
=
0,
68
72
𝑖𝑓 𝑆𝑖 = 𝑆𝑗
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
where T is the correlative matrix,
S is the pitch vector
i, j are the index of matrix
68
72
60
X
0
1
0
1
0
X
0
2
0
0
X
0
3
0
X
0
0
X
0
X
19/25
Multiple Device

Each device will be distributed 1/n scale of database to
compute the similarity to the query input, while n is the
# of devices
System Flow
20/25
Experimental Environment (1/2)
 Experimental
environment (NCHC Formosa 5 Cluster)
OS
RAM
CPU
GPU
Cent OS 6 x86_64
96GB DDR3 ECC
Intel Xeon x5670 six cores 2.93GHz x2
NVIDIA TESLA M2070 x3 (448 cores)
Language version
CUDA version
Compute Capability
C/C++
5.5
2.0
21/25
Experimental Data
 Corpus

information
NTHU&NTU students recordings in 2013
Corpus
CHT Pop Song
Format
WAV, 16KHz, 16bits, mono
File amount
2183 (1614 from NTU, 318 from NTHU and 251 CHT)
WAV file size
8-9 seconds
 Database

19130 songs (after reduce)
22/25
Computation Time (sec.)
Experiment 1 – LS with Different Shared
Pitch Sizes & Block Dimension using Method 2
When the pitch vector size in shared memory is 2000 and the block
dimension is 128, the computation time is 1.427, the shortest.
7
Block dimension: 64,
6
device: 3
Block dimension: 128,
5
device: 3
Block dimension: 256,
4
device: 3
3
Block dimension: 512,
device: 3
2
Block dimension: 1024,
1
device: 3
2 3 4 5 6 7 8 9 10 11
Shared Pitch Size (K)
23/25
Experiment 2 – LS with Different Shared
Pitch Sizes & Block Dimension using Method 3
When the pitch vector size in shared memory is 5000 and the block
dimension is 512, the computation time is 1.213, the shortest .
Computation Time (sec.)
7
Block dimension: 64,
device: 3
Block dimension: 128,
device: 3
Block dimension: 256,
device: 3
Block dimension: 512,
device: 3
Block dimension: 1024,
device: 3
6
5
4
3
2
1
2
3
4
5
6
7
8
9
Shared Pitch Size (K)
10 11
24/25
Experiment 3 – LS with Different Number
of Thread using Three Different Method
Computation Time per Song (sec.)
The best case of LS method 2 & 3 (with a certain shared pitch size) for
each block dimension is faster than method 1
6
5
4
LS 1, device: 3
3
LS 2, device: 3
LS 3, device: 3
2
1
2000
11000
2000
3000
5000
128
256
512
0
64
1024
LS Block Dimension (# of threads per block)
25/25
Experiment 4– LSDTW with Different Number
of Devices
Computation Time per Song (sec.)
If we have more devices, the computation time becomes lower since
per GPU almost only needs to query 1/n scale of database
7
6.76
6
5
LS 1, block dimension:
128
5.72
5.13
4
LS 2, block dimension:
128, shared pitch size:
2000
3.52
3
3.08
2.69
2
1
1
2
Number of devices
2.41
2.12
1.84
3
LS 3, block dimension:
512, shared pitch size:
5000
26/25
Conclusions and Future Work (待補)

Conclusions

Computation time
LS method 2 is faster than method 1, even though the bank conflict exists
 The Computation time is almost 1/n times while using n devices (GPU)


Future work

Advanced database purification to remove bad songs
Abnormal melody (e.g instrumental only)
 Wrong melody or song name


Improve LS method 2 to reduce bank conflict for Kepler architecture
Different definition of bank conflict from Tesla & Fermi architecture
 Using different method for appending pitches

27/25
Thank you!!
&
DEMO (http://miracle.mirlab.org:8080/miracle)
28/25
Methods: Dynamic Time Warping
𝑡: input pitch vector
𝑟: reference pitch vector
Local paths: 27-45-63 degrees
DTW recurrence:
j
D (i , j )
r(j)
D(i, j ) | t (i )  r ( j ) | 
r(j-1)
min
t(i-1)
t(i)


i
29/25
Note & Pitch data charts
Average number of notes: 913.32, σ : 450.99
 Average number of pitch data: 5849.94, σ: 1878.42

30/25
Note & Pitch data charts
(After Repeating Pattern Removing)
Average number of notes: 694.7, σ : 386
 Average number of pitch data: 4562.54, σ: 1773.6

31/25
Detail of Second-stage B
Song name
Singer
Different
Same
處理方式
Manually check
Different
Unknown
1. Compute two song names’ edit distance
2. Check the song which edit distance is smaller than
threshold
Same
Different
Manually check
Same
Unknown
Remove data with unknown singer
32/25
System Flow with multiple devices (待補)
User
Sing/hum
the song
CPU
Detect endpoints
and preprocess
audio
Convert to framebased data
GPU
Perform linear
scaling
Reserve different
amount of
candidate songs
Load database
Show top-N
song info.
Post process the
ranking
Perform dynamic
time warping
33/25
以下是高瑋的
備用投影片
35/25
Method: Borda Count
M
Borda Count
Dk   ( R  rik )
i 1
R: # of candidate song
M: # of melody recognition method
rik: the rank which is k-th result in i-th melody
recognition method
Borda Count example
Rank 1-ABCD
(score)
Rank 2-DCAB
(score)
(total score)
A
3
1
4
1
B
2
0
2
4
C
1
2
3
2
D
0
3
3
2
Song
Dk
Rank
41/25
Methods: Comparison of Two Methods
Type
Linear scaling (LS)
Dynamic time
warping (DTW)
Computation time
Faster
Slower
Tempo variation
Deal with uniform tempo
variation
Deal with non-uniform
tempo variation
Key transposition
One-shot
Heuristic search

The proposed system combines linear scaling with
dynamic time warping to accelerate computation.
42/25
Download