BBRx: Extending BBR for Customized TCP Performance NetDev 0x12, Montreal, Canada Jae Won Chung, Feng Li and Beomjun Kim jaewon.chung@viasat.com {feng.li, beomjun.kim}@verizon.com Objectives • BBR is a promising next-gen TCP congestion avoidance candidate, but has a room for performance improvements – better manage bottleneck queuing delay. • Extend BBR to provide a method to find an optimal throughput-delay operation point in LTE environment to maximize per-flow throughput within a bounded loss rate and delay. • Practically deploy learning technics to ensure optimal TCP performance all the time while minimizing the risk of deploying pretuned learning algorithm in the fast path. Theory Behind BBRx – PI Controller for TCP Estimation Noise Theory Behind BBRx – PI Controller for TCP BBR is a special case of BBRx Estimation Noise where γ = 1 and β = 0 Code Added to tcp_bbr.c (1) static void bbr_main(struct sock *sk, const struct rate_sample *rs) { struct bbr *bbr = inet_csk_ca(sk); u32 bw; bbr_update_model(sk, rs); bw = bbr_bw(sk); bw = bbrx_target_bw(sk, rs, bw); bbr_set_pacing_rate(sk, bw, bbr->pacing_gain); bbr_set_tso_segs_goal(sk); bbr_set_cwnd(sk, rs, rs->acked_sacked, bw, bbr->cwnd_gain); } • • • Baselined at linux-4.15.18 ICSK_CA_PRIV_SIZE = 104 Added struct tcp_bbrx_info to tcp_cc_info Code Added to tcp_bbr.c (2) /* BBRX: Find target tx-rate (bw) using PI control logic */ static u32 bbrx_target_bw(struct sock *sk, const struct rate_sample *rs, u32 bw) { struct bbr *bbr = inet_csk_ca(sk); struct tcp_sock *tp = tcp_sk(sk); u32 tgt_rtt_us, min_rtt_us, rtt_us=0; u32 tgt_bw, tgt_bw_adj; min_rtt_us = bbr->min_rtt_us; if (unlikely(min_rtt_us == 0)) { min_rtt_us = tcp_min_rtt(tp); if (unlikely(min_rtt_us == 0)) min_rtt_us = 100; //assume minRTT of 100us } if (rs->rtt_us > 0) rtt_us = rs->rtt_us; tgt_rtt_us = (bbr->k * min_rtt_us) / BBRX_K_MIN; } /* PI control logic */ if (rtt_us > tgt_rtt_us) { tgt_bw_adj = bbr->beta * bw / 100; tgt_bw_adj *= rtt_us - tgt_rtt_us; tgt_bw_adj /= min_rtt_us; tgt_bw = bw * BBRX_GAMMA / 100 - tgt_bw_adj; } else { tgt_bw = bw * BBRX_GAMMA / 100; if (BBRX_PI_UP_CTR) { tgt_bw_adj = bbr->beta * bw / 100; tgt_bw_adj *= tgt_rtt_us - rtt_us; tgt_bw_adj /= min_rtt_us; tgt_bw += tgt_bw_adj; } } return max_t(u32, BBRX_TGT_BW_MIN, tgt_bw); Control Parameter Values? • Target Utilization γ = 1 (or a little less, say 0.98) • Epoch δ = min RTT (Rmin) • Target RTT deciding factor k ≥ 4 Not trivial to pick ! • Reduced PI parameter β > 0 (perhaps < 1) • Tuning Challenges: The optimal parameter range may differ for network conditions (C and Rmin) Learning Agent • Tuning Methods (user space) Flow Config • Frequency Response Analysis Stats Params • Empirically using Learning TCP Stack Learning Agent: BBRx Auto-Tuning Method • Subscribe to TCP flow stats via NetLink socket. • Classify TCP flows into different bins based on the reported bottleneck bandwidth (C) and Rmin • For each traffic class bin, - Compute average utility of flows when enough samples are collected (default 40) - Find minimum k and the corresponding β that yields the highest average utility (U) using gradient ascendant algorithm. - Update BBRx kernel module parameter table { … } } "cong_control": { "bbrx": { "bbrx_bw_lo": 12584068, "bbrx_bw_hi": 0, "bbrx_min_rtt": 1845, "bbrx_brst_len": 9973, "bbrx_brst_tput": 48788, "bbrx_brst_ploss": 0, "bbrx_brst_k": 6, "bbrx_brst_beta": 50 } } Learning Agent: BBRx Auto-Tuning Method • Subscribe to TCP flow stats via NetLink socket. • Classify TCP flows into different bins based on the reported bottleneck bandwidth (C) and Rmin • For each traffic class bin, - Compute average utility of flows when enough samples are collected (default 40) - Find minimum k and the corresponding β that yields the highest average utility (U) using gradient ascendant algorithm. - Update BBRx kernel module parameter table { … } } "cong_control": { "bbrx": { "bbrx_bw_lo": 12584068, "bbrx_bw_hi": 0, "bbrx_min_rtt": 1845, "bbrx_brst_len": 9973, "bbrx_brst_tput": 48788, "bbrx_brst_ploss": 0, "bbrx_brst_k": 6, "bbrx_brst_beta": 50 } } Utility: Queue Length vs. Optimal β Range • BBRx becomes BBR for a small β = 0.01, and yields a lower flow utility due to overflows as the bottleneck buffer size is reduced. • Large β beyond the optimal value (β = 0.6) decreases the utility due to control instability (Larger magnitude sinusoidal pattern of high queuing delay followed by link under-utilization). • System has a wide range of stable β [0.2, 0.6] providing a large margin of configuration freedom. Utility: BW vs. RTT vs. Optimal β Range • No single optimal range to cover all networking conditions. • The stable β margins are wide (reduced as BW reduces and RTT grows). BBW (C) = 15Mbps BBW (C) = 35Mbps BBRx Params: γ=1, k=6; Bottleneck: qlen = 175ms, flows = 6 BBW (C) = 75Mbps BBRx Multi-Range Configuration Table • Network condition-based configuration approach • BBRx sender start with default parameter set (β = 0.45, k=4) • BBRx sender refers to the table when entering PROBE_BW state • Learning Agent daemon update each bin separately Recommended Default Values Based on Emulation Results Rmin (ms) C (Mbps) [0,3) [3,10) [10,1k) [1k,∞) [0,50) β = 0.75, k=4 β = 0.75, k=4 β = 0.75, k=4 β = 0.75, k=4 [50,100) β = 0.45, k=4 β = 0.45, k=4 β = 0.75, k=4 β = 0.75, k=4 [100,∞) β = 0.25, k=4 β = 0.45, k=4 β = 0.75, k=4 β = 0.75, k=4 Preliminary Evaluations <General Purpose Config – 1st Delay, 2nd Goodput> Emulation Topology Host Container Iperf3 -c Iper3 –s bridge veth0 veth1 > tc qdisc show dev veth0 qdisc netem 1: root refcnt 2 limit 1000 delay 12.0ms qdisc tbf 2: parent 1:1 rate 100Mbit burst 384Kb lat 132.4ms > tc qdisc show dev veth1 qdisc netem 3: root refcnt 2 limit 1000 delay 13.0ms 1-Flows Test: C=100Mbps, Rmin=25ms, qlen=132ms CUBIC [ ID] [ 4] [ 4] BBR [ ID] [ 4] [ 4] Interval 0.00-20.00 sec 0.00-20.00 sec Transfer Bandwidth Retr 213 Mbytes 89.4 Mbits/sec 104 211 Mbytes 88.6 Mbits/sec sender receiver Interval 0.00-20.00 sec 0.00-20.00 sec Transfer Bandwidth Retr 229 Mbytes 96.0 Mbits/sec 0 226 Mbytes 94.6 Mbits/sec sender receiver BBRx (γ=0.98, k=4, β=0.5) [ ID] [ 4] [ 4] Interval 0.00-20.00 sec 0.00-20.00 sec Transfer Bandwidth Retr 225 Mbytes 94.4 Mbits/sec 0 221 Mbytes 92.9 Mbits/sec sender receiver C = 100Mbps, Rmin = 25ms, qlen = 132ms, flow = 1 CUBIC (owin) BBR (owin) BBRx (owin) γ=0.98, k=4, β=0.5 CUBIC (RTT) BBR (RTT) BBRx (RTT) γ=0.98, k=4, β=0.5 4-Flows Test: C=10Mbps, Rmin=25ms, qlen=132ms CUBIC [ ID] [SUM] [SUM] Interval Transfer Bandwidth Retr 0.00-20.00 sec 235 MBytes 98.4 Mbits/sec 123 0.00-20.00 sec 224 MBytes 93.9 Mbits/sec sender receiver Interval Transfer Bandwidth Retr 0.00-20.00 sec 240 MBytes 100 Mbits/sec 0 0.00-20.00 sec 225 MBytes 94.3 Mbits/sec sender receiver BBR [ ID] [SUM] [SUM] BBRx (γ=0.98, k=4, β=0.5) [ ID] [SUM] [SUM] Interval Transfer Bandwidth Retr 0.00-20.00 sec 239 MBytes 100 Mbits/sec 843 0.00-20.00 sec 224 MBytes 93.8 Mbits/sec sender receiver C = 100Mbps, Rmin = 25ms, qlen = 132ms, flow = 4 CUBIC (owin) CUBIC (RTT) BBR (owin) BBRx (owin) γ=0.98, k=4, β=0.5 BBR (RTT) BBRx (RTT) γ=0.98, k=4, β=0.5 BBRx RTTs Scenario: C = 100Mbps, Rmin = 25ms, qlen = 132ms, flow = 4 BBRx Params: γ=0.98, k=4, β=0.5 Preliminary Evaluations <Wireless PEP Config> 1st Goodput, 2nd Delay, and Small number of Flows per Device 1-Flow Test: C=100Mbps, Rmin=25ms, qlen=132ms BBRx: RTT (γ=1, k=6, β=0.5) Tput = 93.1 Mbits/sec ReTx = 0 BBRx: RTT (γ=1, k=8, β=0.5) Tput = 93.6 Mbits/sec Re-Tx = 0 4G Stationary Test – Good RF (SINR > 25dB) • 100MB file download • BBR flow averaged 92.6 Mbps. • BBRx flow averaged 110.2 Mbps BBRx: owin γ=1, k=6, β=0.5 Galaxy S7 LTE-Advance w/ Carrier Aggregation BBRx: RTT γ=1, k=6, β=0.5 HP 460c Gen8 Summary • BBRx: Introduce PI control function to BBR • Export per-flow TCP stats to user-space via NetLink socket • Learning Agent: • Adopts a utility function to score average TCP CA performance and adjust the BBRx control parameter to yield the best utility while keep the RTT to minimum. • The loosely coupled TCP tuning feedback control loop provides a novel way to monitor and adjust TCP parameters per the performance goal in real time while minimizing the risk of deploying pre-tuned learning algorithm in the fast path. Current Status • Preliminary evaluation results looks promising • BBRx reduces shallow buffer overflows as reacting to RTT • Customize TCP performance for LTE access network to find an optimal throughput and delay operation point (increase k till find maximum U). • Proposed a kernel patch (pending) to get TCP congestion control information via NetLink socket on flow termination event. • BBRx and the TCP stat collector code available at: https://github.com/ultragoose/bbrx Future Works • More evaluations under 4G, 5G and Satellite environments • Evaluate fairness among BBRx flows. • Evaluate a PI control function variations such as one used in ABC. Questions?