记一次媒体服务器网络调优

发表于 2026-03-31

最近参与了一款水印网关的研发，主要负责 SIP 信令的 Kernel Module编写，当整套软件部署到新环境后出现了严重的 UDP 性能问题，以下为排查过程与心得。

背景

根据 GB28181 所述，监控的传输可以分为控制平面与数据平面。其中，控制平面为 SIP 协议，主要传输方式为 UDP，某些场景下会以 TCP 传输，极少数极端场景可能出现 TLS。数据平面主要为 RTP 协议，传输层共有 3 种传播模式：UDP（媒体服务器->调阅端）、TCP 被动（媒体服务器->调阅端）、TCP 主动（调阅端->媒体服务器）。

TCP 为流式传输，并且 TCP 具备可靠传输的特性，我们在处理网络问题的时候只需要考虑 TCP 不会出现大量积压即可（且目前场景并未出现相关问题）。而由于 UDP 是不可靠的传输层协议，当 UDP 传输发生丢包、乱序等问题时，调阅端会发生掉帧、画面卡顿等问题。

我们在早期研发过程中使用发行版为 Ubuntu 22.04，由于 Ubuntu 较为成熟，且默认配置比较完善，问题并不显著；然而我们需要部署到某操作系统上，此时网络性能问题变得显著起来。Tx 与 Rx 同时出现的大量丢包。

服务器为12c64g，内核版本 5.10。

Tx 问题排查

首先我们发现调阅端在通过 UDP 协议调阅大量视频的时候频发卡顿问题，并且注意到调阅端回报高丢包率，且反馈有入站丢包。接下来我们应当明确具体的丢包原因，直接祭出 PWRU。

由于当前内核为 Linux 5.10，缺少 kfree_skb_reason 与 sk_skb_reason_drop 函数，因此我们直接使用 pwru 过滤出 kfree_skb 函数，并编写相关规则，保证我们能看到的都是媒体服务器报文。

在我们的实现中，流媒体服务的端口是固定的一个端口范围，因此易得 ./pwru --filter-func="kfree_skb" --output-stack 'udp and portrange 10000-30000'

在 pwru 的输出中，我们可以观察到：

kfree_skb
kfree_skb_list
__dev_queue_xmit
ip_finish_output2
ip_send_skb
udp_send_skb
udp_sendmsg
sock_sendmsg
____sys_sendmsg
___sys_sendmsg
__sys_sendmmsg
__x64_sys_sendmmsg
do_syscall_64

可以看到，当内核准备把报文丢给网卡队列的时候，对应的 skb 却被释放掉了，因此我们要着重排查网卡队列，直接 tc -s qdisc show dev eno1看一眼网卡队列。

qdisc mq 0: root
 Sent 10787280147 bytes 10373351 pkt (dropped 12914, overlimits 0 requeues 1)
 backlog 0b 0p requeues 1
qdisc fq 0: parent :4 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 3028b initial_quantum 15140b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 1511237374 bytes 1779800 pkt (dropped 5798, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 1338 (inactive 1314 throttled 0)
  gc 0 highprio 0 throttled 147232 latency 11.1us flows_plimit 5798
qdisc fq 0: parent :3 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 3028b initial_quantum 15140b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 5784089071 bytes 4955313 pkt (dropped 3206, overlimits 0 requeues 1)
 backlog 0b 0p requeues 1
  flows 1347 (inactive 1320 throttled 0)
  gc 0 highprio 2 throttled 314729 latency 12.8us flows_plimit 3206
qdisc fq 0: parent :2 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 3028b initial_quantum 15140b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 1996773949 bytes 1990332 pkt (dropped 2256, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 1341 (inactive 1329 throttled 0)
  gc 0 highprio 0 throttled 239306 latency 10.8us flows_plimit 2256

查看网卡 Tx 队列大小 ip link show eno1 | grep qlen

1	2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000

查看网卡 Ring Buffer 大小 ethtool -g eno1

Ring parameters for eno1:
Pre-set maximums:
RX:		4096
RX Mini:	n/a
RX Jumbo:	n/a
TX:		4096
Current hardware settings:
RX:		256
RX Mini:	n/a
RX Jumbo:	n/a
TX:		256
RX Buf Len:		n/a
TX Push:	n/a

我们可见软件队列 qdisc 出现了大量的丢包数，总数为 12914，并且网卡 Ring Buffer 队列只有 256 的大小。而在我们的场景下，当同时有 16 条流存在时，数据如下所示：

==========================================================================================
时间     | RX PPS (接收包) | TX PPS (发送包) | RX 吞吐量 (接收) | TX 吞吐量 (发送)
------------------------------------------------------------------------------------------
16:54:38   | 6181 p/s        | 6134 p/s        | 52.82 Mbps         | 61.31 Mbps
16:54:39   | 7006 p/s        | 7974 p/s        | 58.80 Mbps         | 81.73 Mbps
16:54:40   | 7521 p/s        | 7485 p/s        | 64.15 Mbps         | 75.12 Mbps
16:54:41   | 7130 p/s        | 7013 p/s        | 61.38 Mbps         | 70.21 Mbps
16:54:42   | 6459 p/s        | 6550 p/s        | 54.60 Mbps         | 66.86 Mbps
16:54:43   | 5004 p/s        | 5286 p/s        | 37.85 Mbps         | 53.84 Mbps
16:54:44   | 7152 p/s        | 7079 p/s        | 62.14 Mbps         | 71.10 Mbps
16:54:45   | 6058 p/s        | 5993 p/s        | 49.99 Mbps         | 58.24 Mbps
16:54:46   | 6193 p/s        | 6328 p/s        | 53.08 Mbps         | 61.21 Mbps
16:54:47   | 8239 p/s        | 7384 p/s        | 74.50 Mbps         | 71.94 Mbps

结合上面所述出的配置，网卡目前只有 256 个 Tx 槽位供发送，不足以支撑我们的场景；同时，tc 所输出的每一条 fq 的都提示 flow_limit 100p 也就是单条流的软件队列最多容许 100 个包，超量则都会被丢弃。

解决方案

此时问题也很明显了，当前队列的大小没办法喂饱我们当前的大吞吐量 UDP 需求，因此我们需要将队列增大。

注意，机器为 12c64g，目前一张网卡且网卡所支持最高 Ring Buffer为 4096/4096（ethtool -g <nic>查看），支持 4 条队列（ethtool -l <nic>查看），请按照实际情况实施

由于数据平面除了 UDP 协议外还存在 TCP，因此我们在 Tune 的过程中需要两者兼顾，首先看 sysctl：

# 省略 fs 相关，主要是提高单进程、总 fd 数

net.core.default_qdisc = fq # 使用fq （fair queueing），bbr 必须使用 fq，并且我们是 TCP/UDP 混合场景，不考虑 pfifo_fast 这类的队列
net.core.netdev_max_backlog = 16384 # 提高 Rx backlog，防 Rx 丢包
net.core.netdev_budget = 600 # 提高单次 SoftIRQ 时最多处理数据包数量 （简单来说就是处理数据包的过程比较大开大合，但减少上下文切换次数）
net.core.somaxconn = 16384 # 提高 accept 队列长度

# 拉高缓冲区大小
net.core.rmem_max = 52428800
net.core.rmem_default = 26214400
net.core.wmem_max = 52428800
net.core.wmem_default = 26214400

# TCP 性能调优
net.ipv4.tcp_congestion_control = bbr # BBR 阻塞控制
# 省略
net.ipv4.tcp_moderate_rcvbuf = 1
# 拉高读写缓冲区
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432
# 省略

# UDP 读写的总缓冲区稍微拉大，防止不够写（实际按照当前 GPU 承载能力，已经约等于无限制了，这里按照 page size计算的）
net.ipv4.udp_mem = 1572864 2359296 3145728

编写 systemd service unit对 qdisc 相关配置做持久化

[Unit]
Description=Network Tuning for %i
After=network-online.target
BindsTo=sys-subsystem-net-devices-%i.device

[Service]
Type=oneshot
RemainAfterExit=yes

# 拉满网卡 Ring Buffer
ExecStart=-/sbin/ethtool -G %i rx 4096 tx 4096

# 提高网卡发送队列长度
ExecStart=-/sbin/ip link set %i txqueuelen 10000

# 保留 mq，mq 下挂 4 条 Tx fq 队列
ExecStart=-/sbin/tc qdisc replace dev %i root handle 1: mq
# 调整 fq 队列长度，尤其是单流长度
ExecStart=/bin/bash -c 'for q in /sys/class/net/%i/queues/tx-*; do \
    idx=$(basename $q | cut -d"-" -f2); \
    handle=$((idx + 1)); \
    /sbin/tc qdisc replace dev %i parent 1:$handle fq limit 100000 flow_limit 10000; \
done'

# XPS 绑核，防止 cache miss，掩码计算后转hex
ExecStart=/bin/bash -c 'echo "007" > /sys/class/net/%i/queues/tx-0/xps_cpus'
ExecStart=/bin/bash -c 'echo "038" > /sys/class/net/%i/queues/tx-1/xps_cpus'
ExecStart=/bin/bash -c 'echo "1c0" > /sys/class/net/%i/queues/tx-2/xps_cpus'
ExecStart=/bin/bash -c 'echo "e00" > /sys/class/net/%i/queues/tx-3/xps_cpus'

[Install]
WantedBy=multi-user.target

验证

打开所有流开始压测，密切关注以下指标：

tc -s qdisc show dev eno1，主要关注 backlog 是否一直升高？是否有dropped？
netstat -su，主要关注 error 是否提高

在压测 2小时后，结果如下：

➜  ~ tc -s qdisc show dev eno1
qdisc mq 1: root
 Sent 52015346550 bytes 44145498 pkt (dropped 0, overlimits 0 requeues 29)
 backlog 0b 0p requeues 29
qdisc fq 8005: parent 1:1 limit 100000p flow_limit 10000p buckets 1024 orphan_mask 1023 quantum 3028b initial_quantum 15140b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 9494275686 bytes 7405899 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 1489 (inactive 1488 throttled 0)
  gc 0 highprio 0 throttled 413425 latency 13us
qdisc fq 8007: parent 1:3 limit 100000p flow_limit 10000p buckets 1024 orphan_mask 1023 quantum 3028b initial_quantum 15140b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 10636670030 bytes 10433732 pkt (dropped 0, overlimits 0 requeues 3)
 backlog 0b 0p requeues 3
  flows 1485 (inactive 1484 throttled 0)
  gc 0 highprio 0 throttled 540617 latency 15.4us
qdisc fq 8006: parent 1:2 limit 100000p flow_limit 10000p buckets 1024 orphan_mask 1023 quantum 3028b initial_quantum 15140b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 11805631935 bytes 11047758 pkt (dropped 0, overlimits 0 requeues 4)
 backlog 0b 0p requeues 4
  flows 1485 (inactive 1484 throttled 0)
  gc 0 highprio 0 throttled 550048 latency 13.5us
qdisc fq 8008: parent 1:4 limit 100000p flow_limit 10000p buckets 1024 orphan_mask 1023 quantum 3028b initial_quantum 15140b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 20078768899 bytes 15258109 pkt (dropped 0, overlimits 0 requeues 22)
 backlog 0b 0p requeues 22
  flows 1503 (inactive 1501 throttled 0)
  gc 0 highprio 0 throttled 342987 latency 13.7us

可见只有少量的 requeue，而当 requeue 发生时，视频画面会出现些许停顿。

而在压测过程中：

Udp:
    15108213 packets received
    266954 packets to unknown port received
    458 packet receive errors
    18791271 packets sent
    458 receive buffer errors
    43397 send buffer errors
    IgnoredMulti: 14291

压测期间相关 error 读数并未提升。

Rx 问题排查

除此之外我们还发现了 Rx 出现了丢包，遂使用相同的手段（PWRU）进行了排查：

kfree_skb
__udp4_lib_rcv
ip_protocol_deliver_rcu
ip_local_deliver_finish
ip_sublist_rcv_finish
ip_list_rcv_finish.constprop.0
ip_list_rcv
__netif_receive_skb_list_core
__netif_receive_skb_list
netif_receive_skb_list_internal
napi_complete_done
igb_poll[igb]
napi_poll
net_rx_action
__softirqentry_text_start
asm_call_sysvec_on_stack
do_softirq_own_stack
irq_exit_rcu
common_interrupt
asm_common_interrupt
cpuidle_enter_state
cpuidle_enter
cpuidle_idle_call
do_idle
cpu_startup_entry
secondary_startup_64_no_verify

直接翻源码，我们可以很容易发现：

__UDP_INC_STATS(net, UDP_MIB_NOPORTS, proto == IPPROTO_UDPLITE);
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0);

/*
 * Hmm.  We got an UDP packet to a port to which we
 * don't wanna listen.  Ignore it.
 */
kfree_skb(skb);
return 0;

也就是没有找到对应的监听端口，实际上结合我们排查 Tx 的时候的 netstat 读数，事情已经相对明了了：266954 packets to unknown port received，且在持续增高。

并且这个丢包发生在调阅端关闭一个流后重新发起的阶段，我们不难还原这个时序：

调阅端发起 BYE 报文
kmod接到 BYE 报文，通知用户空间
用户空间程序调用媒体服务器，释放相关socket
媒体服务器停止推流 <- 这时候才停止 UDP 发送

而除了这个调用栈外未见其他调用栈，也就说 Tx 即便发生丢包，也是单纯的网络问题，而不是协议栈的问题。