Perceptual Distortion-Rate Optimization of Long Term Prediction in MPEG AAC Tejaswi Nanjundaswamy, Vinay Melkote, Emmanuel Ravelli and Kenneth Rose Signal Compression Lab Department of ECE UCSB November 7, 2010 Outline 1 Introduction 2 Current approach for LTP parameter value selection 3 Proposed RD optimization of LTP parameter values 4 Results Outline 1 Introduction 2 Current approach for LTP parameter value selection 3 Proposed RD optimization of LTP parameter values 4 Results Audio Coding Audio Signal Audio Coding Signal divided into overlapping frames Audio Coding Frame transformed Audio Coding Frame transformed Audio Coding Transform coecients split into bands Audio Coding Transform coecients split into bands Audio Coding Bands quantized and coded to generate bitstream Audio Coding Coding problem denition: Achieve minimum perceptual distortion at a given rate Perceptual? Based on content, human brain can tolerate (or mask) variable amount of noise in each band Captured in distortion measure as Maximum Noise to Mask Ratio (MNMR) MNMR = Quantization noise energy ∀ bands Masking threshold max Masking threshold estimated via psycho-acoustic analysis of input frame Audio Coding Coding problem denition: Achieve minimum perceptual distortion at a given rate Perceptual? Based on content, human brain can tolerate (or mask) variable amount of noise in each band Captured in distortion measure as Maximum Noise to Mask Ratio (MNMR) MNMR = Quantization noise energy ∀ bands Masking threshold max Masking threshold estimated via psycho-acoustic analysis of input frame Audio Coding Coding problem denition: Achieve minimum perceptual distortion at a given rate Perceptual? Based on content, human brain can tolerate (or mask) variable amount of noise in each band Captured in distortion measure as Maximum Noise to Mask Ratio (MNMR) MNMR = Quantization noise energy ∀ bands Masking threshold max Masking threshold estimated via psycho-acoustic analysis of input frame Audio Coding Coding problem denition: Achieve minimum perceptual distortion at a given rate Perceptual? Based on content, human brain can tolerate (or mask) variable amount of noise in each band Captured in distortion measure as Maximum Noise to Mask Ratio (MNMR) MNMR = Quantization noise energy ∀ bands Masking threshold max Masking threshold estimated via psycho-acoustic analysis of input frame Audio Coding Coding problem denition: Achieve minimum perceptual distortion at a given rate Perceptual? Based on content, human brain can tolerate (or mask) variable amount of noise in each band Captured in distortion measure as Maximum Noise to Mask Ratio (MNMR) MNMR = Quantization noise energy ∀ bands Masking threshold max Masking threshold estimated via psycho-acoustic analysis of input frame Audio Coding MDCT exploits redundancies within the current frame Audio Coding MDCT exploits redundancies within the current frame Audio signal has a repeating pattern Audio Coding MDCT exploits redundancies within the current frame Audio signal has a repeating pattern Previously reconstructed data available at decoder Audio Coding MDCT exploits redundancies within the current frame Audio signal has a repeating pattern Previously reconstructed data available at decoder Can we exploit this correlation? MPEG AAC LTP MPEG AAC uses the Long Term Prediction (LTP) tool MPEG AAC LTP MPEG AAC uses the Long Term Prediction (LTP) tool Predicts current frame from history MPEG AAC LTP MPEG AAC uses the Long Term Prediction (LTP) tool Predicts current frame from history Reference position indicated via lag index MPEG AAC LTP MPEG AAC uses the Long Term Prediction (LTP) tool Predicts current frame from history Reference position indicated via lag index Waveforms matched via gain factor MPEG AAC LTP MPEG AAC uses the Long Term Prediction (LTP) tool Predicts current frame from history Reference position indicated via lag index Waveforms matched via gain factor Transformed MPEG AAC LTP Transformed coecients split into bands MPEG AAC LTP Transformed coecients split into bands Prediction residue generated MPEG AAC LTP Transformed coecients split into bands Prediction residue generated Compared with original MPEG AAC LTP Transformed coecients split into bands Prediction residue generated Compared with original Per band LTP ag set MPEG AAC LTP Transformed coecients split into bands Prediction residue generated Compared with original Per band LTP ag set Per frame ag indicates if LTP is used at all in current frame MPEG AAC LTP The overall LTP parameter set includes Lag index Gain factor Per band LTP ag Per frame LTP ag Outline 1 Introduction 2 Current approach for LTP parameter value selection 3 Proposed RD optimization of LTP parameter values 4 Results Current approach Lag and gain selected to minimize a mean squared prediction error cost Current approach Lag and gain selected to minimize a mean squared prediction error cost The solution results in following choice of lag (L) 2K −1 ∑ x [m ]z [m + 2 K − L 0 ] L = arg max ms=0 L0 ∈[0,2K ) 2K −1 ∑ m=0 z 2 [m + 2K − L0 ] Current approach Lag and gain selected to minimize a mean squared prediction error cost The solution results in following choice of lag (L) 2K −1 ∑ x [m ]z [m + 2 K − L 0 ] L = arg max ms=0 L0 ∈[0,2K ) 2K −1 ∑ m=0 z 2 [m + 2K − L0 ] Current approach Lag and gain selected to minimize a mean squared prediction error cost The solution results in following choice of lag (L) 2K −1 ∑ x [m ]z [m + 2 K − L 0 ] L = arg max ms=0 L0 ∈[0,2K ) 2K −1 ∑ m=0 z 2 [m + 2K − L0 ] Current approach Lag and gain selected to minimize a mean squared prediction error cost The solution results in following choice of lag (L) 2K −1 ∑ x [m ]z [m + 2 K − L 0 ] L = arg max ms=0 L0 ∈[0,2K ) 2K −1 ∑ m=0 z 2 [m + 2K − L0 ] Current approach Gain (G) calculated as 2K −1 G= ∑ x [m]z [m + 2K − L] m=0 2K −1 ∑ m=0 z 2 [m + 2K − L] Gain further quantized to one of the 8 levels Current approach For each band, LTP ag chosen as 1, if Energy of prediction residue < Energy of original coefficients 0, otherwise The per frame ag is set if heuristic bit savings due to LTP > side-information rate of LTP Current approach For each band, LTP ag chosen as 1, if Energy of prediction residue < Energy of original coefficients 0, otherwise The per frame ag is set if heuristic bit savings due to LTP > side-information rate of LTP Current approach Given all LTP parameters, core AAC parameters are selected via a two-loop search (TLS) For every band, an inner loop nds quantization step size for a target distortion criterion The outer loop then nds Human code books that minimize the bits to encode and if this doesn't meet the rate constraint for the frame, the target distortion is changed and inner loop repeated Current approach Given all LTP parameters, core AAC parameters are selected via a two-loop search (TLS) For every band, an inner loop nds quantization step size for a target distortion criterion The outer loop then nds Human code books that minimize the bits to encode and if this doesn't meet the rate constraint for the frame, the target distortion is changed and inner loop repeated Current approach Given all LTP parameters, core AAC parameters are selected via a two-loop search (TLS) For every band, an inner loop nds quantization step size for a target distortion criterion The outer loop then nds Human code books that minimize the bits to encode and if this doesn't meet the rate constraint for the frame, the target distortion is changed and inner loop repeated Motivation Objective results for reference AAC coder without LTP Single instrument file 1 Single instrument file 2 −10 AMNMR in dB AMNMR in dB −9 −10 −11 −12 −13 −12 −14 −16 nopred−TLS nopred−TLS −18 20 25 30 bitrate (kbps) 35 30 35 40 bitrate (kbps) 45 Motivation Objective results for reference AAC coder without LTP Motivation Objective results for reference AAC coder without LTP Motivation Objective results for reference AAC coder without LTP Single instrument file 1 Single instrument file 2 −10 AMNMR in dB AMNMR in dB −9 −10 −11 −12 −13 −12 −14 −16 nopred−TLS nopred−TLS −18 20 25 30 bitrate (kbps) 35 30 35 40 bitrate (kbps) 45 Motivation Objective results for reference AAC coder with and without LTP Single instrument file 1 Single instrument file 2 −10 AMNMR in dB AMNMR in dB −9 −10 −11 −12 −12 −14 −16 nopred−TLS stdLTP−TLS −13 nopred−TLS stdLTP−TLS −18 20 25 30 bitrate (kbps) 35 30 35 40 bitrate (kbps) 45 Motivation We know that TLS is sub-optimal for core AAC parameters selection Could this be the reason for the poor RD performance? Replace TLS with RD optimal Trellis based core AAC parameters selection [Aggarwal et al. 2006] Motivation We know that TLS is sub-optimal for core AAC parameters selection Could this be the reason for the poor RD performance? Replace TLS with RD optimal Trellis based core AAC parameters selection [Aggarwal et al. 2006] Motivation Objective results for Trellis based AAC coder with and without LTP Single instrument file 1 Single instrument file 2 −16 AMNMR in dB AMNMR in dB −16 −18 −20 −18 −20 −22 −24 nopred−Trel stdLTP−Trel −22 20 25 bitrate (kbps) −26 30 nopred−Trel stdLTP−Trel 30 35 bitrate (kbps) 40 Motivation LTP tool is expected to give improvements for single instrument harmonic les Shortcomings attributed to the sub-optimal LTP parameter selection RD optimal approach has to select all encoder parameters with the objective of minimizing perceptual distortion for a given rate Current approach clearly sub-optimal as LTP parameters selected to minimize mean squared prediction error, and independent of core AAC parameters Lag and gain selection ignores eventual prediction switching o in select bands Time domain lag and gain selection eectively considers all transform coecients Lag and gain thus selected not the best when considering a reduced set of coecients The heuristically estimated bit savings due to LTP doesn't match actual bit savings reected after the quantization and coding process Motivation LTP tool is expected to give improvements for single instrument harmonic les Shortcomings attributed to the sub-optimal LTP parameter selection RD optimal approach has to select all encoder parameters with the objective of minimizing perceptual distortion for a given rate Current approach clearly sub-optimal as LTP parameters selected to minimize mean squared prediction error, and independent of core AAC parameters Lag and gain selection ignores eventual prediction switching o in select bands Time domain lag and gain selection eectively considers all transform coecients Lag and gain thus selected not the best when considering a reduced set of coecients The heuristically estimated bit savings due to LTP doesn't match actual bit savings reected after the quantization and coding process Motivation LTP tool is expected to give improvements for single instrument harmonic les Shortcomings attributed to the sub-optimal LTP parameter selection RD optimal approach has to select all encoder parameters with the objective of minimizing perceptual distortion for a given rate Current approach clearly sub-optimal as LTP parameters selected to minimize mean squared prediction error, and independent of core AAC parameters Lag and gain selection ignores eventual prediction switching o in select bands Time domain lag and gain selection eectively considers all transform coecients Lag and gain thus selected not the best when considering a reduced set of coecients The heuristically estimated bit savings due to LTP doesn't match actual bit savings reected after the quantization and coding process Motivation LTP tool is expected to give improvements for single instrument harmonic les Shortcomings attributed to the sub-optimal LTP parameter selection RD optimal approach has to select all encoder parameters with the objective of minimizing perceptual distortion for a given rate Current approach clearly sub-optimal as LTP parameters selected to minimize mean squared prediction error, and independent of core AAC parameters Lag and gain selection ignores eventual prediction switching o in select bands Time domain lag and gain selection eectively considers all transform coecients Lag and gain thus selected not the best when considering a reduced set of coecients The heuristically estimated bit savings due to LTP doesn't match actual bit savings reected after the quantization and coding process Motivation LTP tool is expected to give improvements for single instrument harmonic les Shortcomings attributed to the sub-optimal LTP parameter selection RD optimal approach has to select all encoder parameters with the objective of minimizing perceptual distortion for a given rate Current approach clearly sub-optimal as LTP parameters selected to minimize mean squared prediction error, and independent of core AAC parameters Lag and gain selection ignores eventual prediction switching o in select bands Time domain lag and gain selection eectively considers all transform coecients Lag and gain thus selected not the best when considering a reduced set of coecients The heuristically estimated bit savings due to LTP doesn't match actual bit savings reected after the quantization and coding process Extension Trellis based approach RD optimal for core AAC parameter selection Extension All possible LTP parameter combinations formed Each case RD evaluated via Trellis Case which minimizes the distortion for a give rate forms nal choice Computationally prohibitive as LTP adds signicantly more choices of parameters for gain (8) lag (frame length) per band LTP ags (2 power number of bands) Extension Trellis based approach RD optimal for core AAC parameter selection Extension All possible LTP parameter combinations formed Each case RD evaluated via Trellis Case which minimizes the distortion for a give rate forms nal choice Computationally prohibitive as LTP adds signicantly more choices of parameters for gain (8) lag (frame length) per band LTP ags (2 power number of bands) Extension Trellis based approach RD optimal for core AAC parameter selection Extension All possible LTP parameter combinations formed Each case RD evaluated via Trellis Case which minimizes the distortion for a give rate forms nal choice Computationally prohibitive as LTP adds signicantly more choices of parameters for gain (8) lag (frame length) per band LTP ags (2 power number of bands) Outline 1 Introduction 2 Current approach for LTP parameter value selection 3 Proposed RD optimization of LTP parameter values 4 Results Proposed approach We achieve computational eciency by Limiting LTP lag and gain parameter space by careful selection of prediction survivors Retains the simplicity of time domain lag and gain calculation Limiting number of LTP parameter combinations also limits the number of full RD evaluations Full RD evaluation enables selection of encoder parameters aligned with the end objective of minimizing perceptual distortion for a given rate Trellis approach, which operates in frequency domain, for selecting the band wise quantization and coding parameters, is extended to select the per band LTP ags as well Proposed approach We achieve computational eciency by Limiting LTP lag and gain parameter space by careful selection of prediction survivors Retains the simplicity of time domain lag and gain calculation Limiting number of LTP parameter combinations also limits the number of full RD evaluations Full RD evaluation enables selection of encoder parameters aligned with the end objective of minimizing perceptual distortion for a given rate Trellis approach, which operates in frequency domain, for selecting the band wise quantization and coding parameters, is extended to select the per band LTP ags as well Proposed approach We achieve computational eciency by Limiting LTP lag and gain parameter space by careful selection of prediction survivors Retains the simplicity of time domain lag and gain calculation Limiting number of LTP parameter combinations also limits the number of full RD evaluations Full RD evaluation enables selection of encoder parameters aligned with the end objective of minimizing perceptual distortion for a given rate Trellis approach, which operates in frequency domain, for selecting the band wise quantization and coding parameters, is extended to select the per band LTP ags as well Proposed approach We achieve computational eciency by Limiting LTP lag and gain parameter space by careful selection of prediction survivors Retains the simplicity of time domain lag and gain calculation Limiting number of LTP parameter combinations also limits the number of full RD evaluations Full RD evaluation enables selection of encoder parameters aligned with the end objective of minimizing perceptual distortion for a given rate Trellis approach, which operates in frequency domain, for selecting the band wise quantization and coding parameters, is extended to select the per band LTP ags as well Proposed approach We achieve computational eciency by Limiting LTP lag and gain parameter space by careful selection of prediction survivors Retains the simplicity of time domain lag and gain calculation Limiting number of LTP parameter combinations also limits the number of full RD evaluations Full RD evaluation enables selection of encoder parameters aligned with the end objective of minimizing perceptual distortion for a given rate Trellis approach, which operates in frequency domain, for selecting the band wise quantization and coding parameters, is extended to select the per band LTP ags as well Proposed approach We achieve computational eciency by Limiting LTP lag and gain parameter space by careful selection of prediction survivors Retains the simplicity of time domain lag and gain calculation Limiting number of LTP parameter combinations also limits the number of full RD evaluations Full RD evaluation enables selection of encoder parameters aligned with the end objective of minimizing perceptual distortion for a given rate Trellis approach, which operates in frequency domain, for selecting the band wise quantization and coding parameters, is extended to select the per band LTP ags as well Prediction survivors P lag indices with the highest normalized cross-correlation are retained 2K −1 R [L] = ∑ x [m]z [m + 2K − L] m s=0 2K −1 ∑ m=0 z 2 [m + 2K − L] Prediction survivors P lag indices with the highest normalized cross-correlation are retained 2K −1 R [L] = ∑ x [m]z [m + 2K − L] m s=0 2K −1 ∑ m=0 z 2 [m + 2K − L] Prediction survivors P lag indices with the highest normalized cross-correlation are retained 2K −1 R [L] = ∑ x [m]z [m + 2K − L] m s=0 2K −1 ∑ m=0 z 2 [m + 2K − L] Prediction survivors P lag indices with the highest normalized cross-correlation are retained 2K −1 R [L] = ∑ x [m]z [m + 2K − L] m s=0 2K −1 ∑ m=0 z 2 [m + 2K − L] Prediction survivors P lag indices with the highest normalized cross-correlation are retained 2K −1 R [L] = ∑ x [m]z [m + 2K − L] m s=0 2K −1 ∑ m=0 z 2 [m + 2K − L] Prediction survivors P lag indices with the highest normalized cross-correlation are retained 2K −1 R [L] = ∑ x [m]z [m + 2K − L] m s=0 2K −1 ∑ m=0 z 2 [m + 2K − L] Prediction survivors Forming the P lag survivors Prediction survivors Gain value for each of these lags found 2K −1 ∑ x [m]z [m + 2K − L] G[L] = m2=K0−1 ∑ m=0 z 2 [m + 2K − L] Prediction survivors Closest Q quantization levels to each gain value are retained Prediction survivors Closest Q quantization levels to each gain value are retained Prediction survivors Closest Q quantization levels to each gain value are retained Prediction survivors To form the overall PQ survivorsg Transformation For each survivor, prediction residue is calculated and transformedg RD evaluation Each of these are rate-distortion evaluated via Trellisg Per frame LTP ag To nd per frame ag, current frame is transformed and RD evaluated Final selection Parameters resulting in minimum distortion for the given rate chosen Trellis optimization Trellis with stages for each band l , states in each stage for every combination of per band LTP ags, quantization and coding parameter values Trellis optimization Each state associated with corresponding distortion and rate costs Transition between states associated with costs to dierentially encode quantization and coding parameters Dynamic programming pursued to nd optimal path through trellis This path corresponds to optimal set of per band parameters Trellis optimization Each state associated with corresponding distortion and rate costs Transition between states associated with costs to dierentially encode quantization and coding parameters Dynamic programming pursued to nd optimal path through trellis This path corresponds to optimal set of per band parameters Trellis optimization Each state associated with corresponding distortion and rate costs Transition between states associated with costs to dierentially encode quantization and coding parameters Dynamic programming pursued to nd optimal path through trellis This path corresponds to optimal set of per band parameters Trellis optimization Each state associated with corresponding distortion and rate costs Transition between states associated with costs to dierentially encode quantization and coding parameters Dynamic programming pursued to nd optimal path through trellis This path corresponds to optimal set of per band parameters Low complexity variant For low complexity Trellis replaced with Two Loop Search Low complexity variant For low complexity Trellis replaced with Two Loop Search Outline 1 Introduction 2 Current approach for LTP parameter value selection 3 Proposed RD optimization of LTP parameter values 4 Results Objective evaluation results For Trellis based AAC coder with and without LTPQpg Single instrument file 1 Single instrument file 2 −16 AMNMR in dB AMNMR in dB −16 −18 −20 −18 −20 −22 −24 nopred−Trel stdLTP−Trel −22 20 25 bitrate (kbps) −26 30 nopred−Trel stdLTP−Trel 30 35 bitrate (kbps) 40 Objective evaluation results Along with results for the proposed coder (with P = 20 and Q = 6) Single instrument file 1 Single instrument file 2 −16 AMNMR in dB AMNMR in dB −16 −18 −20 20 25 bitrate (kbps) −20 −22 −24 nopred−Trel stdLTP−Trel propLTP−Trel −22 −18 −26 30 nopred−Trel stdLTP−Trel propLTP−Trel 30 35 bitrate (kbps) 40 Objective evaluation results Trellis based coders compared to the TLS based codersQpg Single instrument file 1 Single instrument file 2 −10 −12 AMNMR in dB AMNMR in dB −10 −14 −16 nopred−TLS stdLTP−TLS nopred−Trel stdLTP−Trel propLTP−Trel −18 −20 −22 20 25 30 bitrate (kbps) −15 −20 −25 35 nopred−TLS stdLTP−TLS nopred−Trel stdLTP−Trel propLTP−Trel 30 35 40 bitrate (kbps) 45 Objective evaluation results Along with results for proposed low complexity coder Single instrument file 1 Single instrument file 2 −10 −12 AMNMR in dB AMNMR in dB −10 −14 −16 nopred−TLS stdLTP−TLS propLTP−TLS nopred−Trel stdLTP−Trel propLTP−Trel −18 −20 −22 20 25 30 bitrate (kbps) −15 −20 −25 35 nopred−TLS stdLTP−TLS propLTP−TLS nopred−Trel stdLTP−Trel propLTP−Trel 30 35 40 bitrate (kbps) 45 Objective evaluation results For other lesQpg Polyphonic file with dominant instrument Single instrument file 3 −10 AMNMR in dB AMNMR in dB −10 −15 nopred−TLS stdLTP−TLS propLTP−TLS nopred−Trel stdLTP−Trel propLTP−Trel −20 −25 25 30 35 bitrate (kbps) −15 −20 nopred−TLS stdLTP−TLS propLTP−TLS nopred−Trel stdLTP−Trel propLTP−Trel −25 40 45 25 30 35 bitrate (kbps) 40 45 Objective evaluation results For other lesQpg Speech file Complex polyphonic file −8 −8 AMNMR in dB AMNMR in dB −10 −12 −14 nopred−TLS stdLTP−TLS propLTP−TLS nopred−Trel stdLTP−Trel propLTP−Trel −16 −18 20 25 30 35 bitrate (kbps) −10 −12 nopred−TLS stdLTP−TLS propLTP−TLS nopred−Trel stdLTP−Trel propLTP−Trel −14 −16 40 20 25 30 35 bitrate (kbps) 40 45 Subjective evaluation MUSHRA listening tests for coders operating at 32 kbps 12 listeners score on a scale of 0 (bad) to 100 (excellent) Plots show average MUSHRA scores and 95% condence interval Subjective evaluation results Single instrument file 1 100 80 Avg. MUSHRA score 80 Avg. MUSHRA score Single instrument file 2 100 60 40 20 60 40 20 0 0 ref rel rel anc nop T red pro p T LTP ref l anc red nop Tre rel PT pLT pro Subjective evaluation results Single instrument file 3 100 100 80 Avg. MUSHRA score Avg. MUSHRA score 80 Polyphonic file with dominant instrument 60 40 20 60 40 20 0 0 ref rel rel anc nop T red pro p T LTP ref l anc red nop Tre rel PT pLT pro Subjective evaluation results Speech file 100 80 Avg. MUSHRA score 80 Avg. MUSHRA score Complex polyphonic file 100 60 40 20 60 40 20 0 0 ref rel rel anc nop T red pro p T LTP ref l anc red nop Tre rel PT pLT pro Summary Current approach for LTP parameter selection is sub-optimal Joint selection of LTP and core AAC parameters which optimize perceptual distortion-rate performance proposed Low complexity two-loop search based variant also proposed Subjective and objective evaluations show substantial improvements We conclude that when rightly optimized LTP can be a potent tool Summary Current approach for LTP parameter selection is sub-optimal Joint selection of LTP and core AAC parameters which optimize perceptual distortion-rate performance proposed Low complexity two-loop search based variant also proposed Subjective and objective evaluations show substantial improvements We conclude that when rightly optimized LTP can be a potent tool Summary Current approach for LTP parameter selection is sub-optimal Joint selection of LTP and core AAC parameters which optimize perceptual distortion-rate performance proposed Low complexity two-loop search based variant also proposed Subjective and objective evaluations show substantial improvements We conclude that when rightly optimized LTP can be a potent tool Summary Current approach for LTP parameter selection is sub-optimal Joint selection of LTP and core AAC parameters which optimize perceptual distortion-rate performance proposed Low complexity two-loop search based variant also proposed Subjective and objective evaluations show substantial improvements We conclude that when rightly optimized LTP can be a potent tool Summary Current approach for LTP parameter selection is sub-optimal Joint selection of LTP and core AAC parameters which optimize perceptual distortion-rate performance proposed Low complexity two-loop search based variant also proposed Subjective and objective evaluations show substantial improvements We conclude that when rightly optimized LTP can be a potent tool Thank you for your attention Questions?