Massively Parallel LDPC Decoding on GPU

advertisement
Vivek Tulsidas Bhat
Priyank Gupta
MASSIVELY PARALLEL LDPC
DECODING ON GPU
“Workload Partitioning”

Priyank
 Motivation
and LDPC introduction.
 Analysis of the sequential algorithm and build up to the
parallelization strategy.
 Lessons Learned : Part 1

Vivek
 Parallelization
strategy
 Results and Discussion
 Lessons Learned : Part 2
 Conclusion
Motivation




FEC codes used extensively in various applications to
ensure reliability in communication.
Current trends in application show demands in
increased data rates.
Considering Shannon Limit, low complexity encodersdecoders necessary.
Enter LDPC : Low-Density Parity Check.
LDPC : Quick Overview




Iterative approach.
Inherently data-parallel
Computationally
expensive.
Therefore, perfect
candidate for operations
that can be parallelized.
Our Initial Approach
Parallel Code Flow
Likelihood Ratio Initialization
Probability Ratio Initialization
Likelihood Ratio Recomputation
Probability Ratio Recomputation
Next Guess Calculation
No
Found Codeword
or Max Iter.
Yes
Report Results
Analysis of Sequential Code
Sparse Matrix Representation
typedef struct
{
int n_rows;
int n_cols;
mod2entry *rows;
mod2entry *cols;
/* Representation of a sparse matrix */
/* Number of rows in the matrix */
/* Number of columns in the matrix */
/* Ptr to array of row headers */
/* Ptr to array of column headers */
mod2block *blocks;
mod2entry *next_free;
/* Allocated Blocks*/
/* Next free entry */
} mod2sparse;
typedef struct
/* Structure representing a non-zero entry, or
the header for a row or column */
{
int row, col;
/* Row and column indexes */
mod2entry *left, *right, /* Pointers to adjacent entry in row */
*up, *down; /* and column, or to headers. Free */
/* entries are linked by 'left'.*/
double pr, lr; /* Probability and likelihood ratios - not used */
/* by the mod2sparse module itself */
} mod2entry;
Likelihood Ratio Computation
1
0
0
1
1
1
0
0
1
0
1
1
0
1
0
0
1
0
1
1
1
LR_estimator = 1 (initial)
Forward Transition:
element_LR(nth) = LR_estimator(nth)
LR_estimator(n+1th) = LR_estimator(nth) *2/element_PR(n+1th) - 1
Reverse Transition:
temp = element_LR(nth) * LR_estimator(nth)
element_LR (n-1th) = (1-temp) / (1+temp)
LR_estimator(n-1th) = LR_estimator(nth) *2/element_PR(n-1th) - 1
Probability Ratio Computation
1
0
0
1
1
1
0
0
1
0
1
1
0
1
0
0
1
0
1
1
1
PR_estimator(nth) = Likelihood_Ratio (nth) (initial)
Top-Down Transition:
element_PR(nth) = PR_estimator(nth)
PR_estimator(n+1th) = PR_estimator(nth) * element_LR(nth)
Bottom-Up Transition:
element_PR (n-1th) = element_PR (nth) * PR_estimator(nth)
PR_estimator(n-1th) = PR_estimator(nth) * element_LR(nth)
Lessons Learned : Part 1
"entities must not be multiplied beyond
necessity"
Parallelization Strategy
Transformation
Codeword i-2
Codeword i-1
Codeword i
Codeword i+1
Codeword i+2
Likelihood Ratio Computation
Probability Ratio Recomputation
Next Guess Calculation
No
Found Codeword
or Max Iter.
Yes
Report Results
Use 1-D arrays
BSC Channel Data (N , M-bit codewords read at a time)
BSC Data Array with N codewords aligned
Likelihood ratio for all the MN bits
Bit Probabilities for MN bits
Decoded Blocks (N M-bit codewords)
Each thread does the computation for one-bit. So for N M-bit codewords,
we would need MN threads for the Likelihood ratio, Probability Ratio and
Decoded Block related computations
Likelihood Ratio Computation :
Revisited
Likelihood Ratio Estimator : Forward Estimation
1
0
0
1
1
1
0
0
1
0
1
1
0
1
0
0
1
0
1
1
1
Likelihood Ratio Estimator : Reverse Estimation
Likelihood Ratio Estimator calculation for Forward and Reverse
Estimation done on the host before the launch of the Likelihood ratio
kernel.
Note: Illustration for just one codeword. This is done for N codewords at a
time.
Probability Ratio Computation :
Revisited
Probability Ratio Estimator : Top Down Transition
1
0
0
1
1
1
0
0
1
0
1
1
0
1
0
0
1
0
1
1
1
Probability Ratio Estimator : Bottom-Up Transition
Likewise for the Probability Ratio Computation, only this time operations
are done on a column basis
Salient Features of our implementation





Usage of efficient sparse matrix representation of
standard Parity-Check matrix.
Simplistic Mathematical model for likelihood ratio
and probability ratio computation.
Dedicated data structure for likelihood ratio and
probability ratio kernels.
Code is easily customizable for different code rates.
Supports larger number of code words without any
major change to the program architecture.
Experimental Setup
CPU
GPU1
GPU2
Platform
Intel Core 2 NVidia
NVidia
Duo
GeForce 8400 GeForce
GS
GT120
Clock Speed
(Memory Clock)
2.6GHz
900MHz
500MHz
Memory
4GB
512MB
512MB
CUDA Toolkit
Version
-NA-
2.3
2.2
Programming
Environment
Linux
Visual Studio
Linux
Results (1/3)


Tested extensively for code rate of (3,7) on BSC
channel with error probability of 0.05.
Optimal execution configuration : numThreadsPerBlock
= 256, numBlocks = 7* Mul_factor where mul_factor is
evaluated depending on the number of code words to
be decoded
mul_factor = num_codewords / numThreadsPerBlock

Bit error rate is evaluated by comparing percentage
change with respect to original source file.
Results (2/3) : Software Execution Time
Execution Time vs Codewords
12
Execution Time (sec)
10
8
GT120
GeForce 8400
6
Intel Core2 Quad
Sun SPARC v4
4
OpenMP
2
0
0
50000
100000
150000
Codewords
200000
250000
300000
Results (3/3) : Bit Error Rate Curve
BER vs Codewords
5.00E-001
4.50E-001
4.00E-001
3.50E-001
BER
3.00E-001
2.50E-001
CPU
GT120
2.00E-001
1.50E-001
1.00E-001
5.00E-002
0.00E+000
0
50000
100000
150000
Codewords
200000
250000
300000
Lessons Learned : Part 2






High occupancy does not guarantee better performance.
Although GPU implementation provides considerable speedup, its
BER results are not attractive (in fact worse than CPU based
implementation)
Absence of a double-precision floating point unit in GPU impacted
the results. Probability ratio and Likelihood ratio computations are
based on double-precision arithmetic.
Reliability? Random Bit Flips ? Could be catastrophic depending on
the application for which LDPC decoding is being used.
Other programming paradigms : OpenMP ? Not as attractive in
terms of speedup compared to GPU, but better BER curve.
Case for built-in ECC features within GPU architecture : NVIDIA
Fermi architecture!
Future Work



Trying this for AWGN channel for different error
probabilities.
How does this perform on better GPU architectures ?
Tesla ? Fermi ?
Any other parallelization strategies ? CuBLAS
routines for sparse matrix computations on GPU ?
Acknowledgement

We would like to thank Prof. Ali Akoglu and Murat
Arabaci (OCSL Lab) for guiding us throughout the
course of this project.
References



Gabriel Falcao, Leonel Sousa, Vitor Silva, “How
GPUs can outperform ASICs for Fast LDPC
Decoding”, ICS’09.
Gabriel Falcao, Leonel Sousa, Vitor Silva, “ Parallel
LDPC Decoding on the Cell/B.E. Processor”, HiPEAC
2009.
Gregory M. Striemer, Ali Akoglu, “An Adaptive
LDPC Engine for Space Based Communication
Systems”.
Questions : Ask!
Backup Slides
Code Transformation: Likelihood ratio Init Kernel
Code Transformation: Initprp Decode Kernel
Code Transformation: Likelihood Ratio
Kernel
Code Transformation: Probability Ratio Kernel
Code Transformation: Next Guess Kernel
Download