最近遇到了一个 alpine Linux 打包的镜像在 k3s 下内部 dns 请求丢包问题，记录了一下

阅读本文需要的基础知识：

k8s pod/service 基本知识。
iptables dnat, vxlan, linux route 等基本网络知识。
tcp/ip 协议，tcpdump 抓包解读，以及客户端请求 http 时 dns 解析原理相关知识。

本文主要技术要点：

alpine linux 使用的 musl libc 和 glibc 有些不相同的细节（/etc/resolv.conf 读取请求相关）。
k3s 下 pod 做 dns 请求是数据包流通原理。
conntrack 工作原理。

一、简述

在开发一个 laravel/php 项目过程中，使用了一些第三方 sdk ，它会做 http 请求。但是请求特别慢，大概在 2.5s 、5 秒多，甚至超时，于是有了这次的 debug 过程。

二、服务器环境

主物理机：debian9

内核：4.9.0-13-amd64 #1 SMP Debian 4.9.228-1 (2020-07-05) x86_64 GNU/Linux

k3s master：v1.20.5+k3s1 (355fff30)

三、bug 原理

0x00 、k3s 环境网络拓扑图

0x01 、主物理机路由信息

liuxu@master:~$ ip route
default via 10.158.3.1 dev eth0 onlink
10.42.0.0/24 dev cni0 proto kernel scope link src 10.42.0.1
10.42.2.0/24 via 10.42.2.0 dev flannel.1 onlink
10.158.3.0/24 dev eth0 proto kernel scope link src 10.158.3.24
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown

liuxu@master:~$ ip -d link show flannel.1
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/ether 42:c0:6a:0f:fc:ad brd ff:ff:ff:ff:ff:ff promiscuity 0
    vxlan id 1 local 10.158.3.24 dev eth0 srcport 0 0 dstport 8472 nolearning ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

liuxu@master:~$ ip -d link show cni0
5: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether a6:e6:d3:93:36:60 brd ff:ff:ff:ff:ff:ff promiscuity 0
    bridge forward_delay 1500 hello_time 200 max_age 2000 ageing_time 30000 stp_state 0 priority 32768 vlan_filtering 0 vlan_protocol 802.1Q bridge_id 8000.a6:e6:d3:93:36:60 designated_root 8000.a6:e6:d3:93:36:60 root_port 0 root_path_cost 0 topology_change 0 topology_change_detected 0 hello_timer    0.00 tcn_timer    0.00 topology_change_timer    0.00 gc_timer  246.38 vlan_default_pvid 1 group_fwd_mask 0 group_address 01:80:c2:00:00:00 mcast_snooping 1 mcast_router 1 mcast_query_use_ifaddr 0 mcast_querier 0 mcast_hash_elasticity 4 mcast_hash_max 512 mcast_last_member_count 2 mcast_startup_query_count 2 mcast_last_member_interval 100 mcast_membership_interval 26000 mcast_querier_interval 25500 mcast_query_interval 12500 mcast_query_response_interval 1000 mcast_startup_query_interval 3124 nf_call_iptables 0 nf_call_ip6tables 0 nf_call_arptables 0 addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

所以由上可知：

overlay 网络层看，coreDNS和业务 pod不是同一个网段，也就是分别在 2 台物理服务器上。coreDNS在 master 服务器上，业务 pod在 agent 服务器上。
pod 中的容器通过 veth 桥接到 cni0 网桥，网络和 flannel.1(vxlan)也桥接上，所以业务 pod和coreDNS是通过 flannel.1 通信。
flannel.1 通过 eth0 通信。

0x02 、dns 查询和 libc.so

a. libc.so 的问题。

在实际应用中，libc.so 实际上有两种。一种是 glibc.so ，ubuntu 、debian 、centos 这些系统使用。还有一种 musl 版本的 libc.so ，由 alpine linux 在使用。而业务 pod基于 apline linux 打包的容器镜像。

它们之间实际上有很多微小的差异： https://wiki.musl-libc.org/functional-differences-from-glibc.html

与本次 bug 相关的有：

glibc.so 和 musl libc.so 查询 dns 时，都会并发的发送 A 和 AAAA 两个请求，其目的是为了兼容 ipv4 和 ipv6 。
musl libc.so 不支持 single-request-reopen 、single-request 等选项，且此类选项是 glibc.so 2.9 、2.10 才支持。
对于 /etc/resolv.conf 中的 nameserver ，如果有多条记录，glibc.so 会从上往下按顺序使用。如果第一个 nameserver 无法访问，则再使用第二个 nameserver 。而 musl libc.so 则会同时读取多条 nameserver 建立多条连接并发 dns 请求，并使用最先收到的返回。
php curl 模块，或者 curl 命令都使用 libcurl.so ，而 libcurl.so 会使用 libc.so ，所以调试 php 的 curl 时，可直接使用 curl 命令代替。

b. /etc/resolv.conf 说明。

liuxu@master:~$ cat /etc/resolv.conf
nameserver 10.43.0.100
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

这个文件意思是，如果要访问一个域名 test.example.cn ，会经过一下步骤:

通过 10.43.0.100 查询 test.example.cn.default.svc.cluster.local 的 A 和 AAAA 记录，timeout 默认为 5 秒，实际上是 A 和 AAAA 的 timeout 各 2.5 秒。
通过 10.43.0.100 查询 test.example.cn.svc.cluster.local 的 A 和 AAAA 记录。
通过 10.43.0.100 查询 test.example.cn.cluster.local 的 A 和 AAAA 记录
通过 10.43.0.100 查询 test.example.cn 的 A 和 AAAA 记录，此时coreDNS读取宿主机 /etc/resolv.conf ，根据宿主机的 nameserver 转发请求并返回。

0x03 、关键信息抓包定位点

业务 pod查询 test.example.cn 的 DNS 时，数据包流通路径：

业务 pod(10.42.2.87/24)从 pod 中向coreDNS service(10.43.0.100/16)发送 A/AAAA 请求。
iptables 通过 dnat 更换 dst ip 10.43.0.100 到 10.42.0.16(coreDNS pod ip)。
数据进入 agent 服务器 cni0 ，agent 服务器 cni0 将请求转发给从机 flannel.1 。
agent 服务器 flannel.1 将请求打包为 vxlan 包，交给 agent 服务器 eth0(10.158.3.35/24)，agent 服务器 eth0 通过云服务商网络，将数据发送给 master 服务器 eth0(10.158.3.24/24)。
master 服务器 eth0 收到是 vxlan 数据包(flannle.1 8472 端口)，将数据包交给 master 服务器 flannel.1 。
master 服务器 flannel.1 将数据包解开，得到 dns 请求数据包，通过目的地址为 10.42.0.0/24 和 route ，所以将数据包交给 master 服务器 cni0 。（此处丢包，本次 bug 原因）
master 服务器 cni0 通过 veth 将请求交给coreDNS pod 。
coreDNS解析 dns 请求后按原路返回数据包。

0x04 、抓包信息

agent 服务器 cni0 抓包到的数据包：

...
14:21:47.814064 IP (tos 0x0, ttl 64, id 10593, offset 0, flags [DF], proto UDP (17), length 83)
    10.42.2.87.35181 > 10.42.0.16.domain: [bad udp cksum 0x17d3 -> 0x6300!] 25419+ A? test.example.cn.cluster.local. (55)
14:21:47.814096 IP (tos 0x0, ttl 64, id 10594, offset 0, flags [DF], proto UDP (17), length 83)
    10.42.2.87.35181 > 10.42.0.16.domain: [bad udp cksum 0x17d3 -> 0x468e!] 25789+ AAAA? test.example.cn.cluster.local. (55)
14:21:47.814367 IP (tos 0x0, ttl 62, id 60155, offset 0, flags [DF], proto UDP (17), length 176)
    10.43.0.100.domain > 10.42.2.87.35181: [udp sum ok] 25419 NXDomain*- q: A? test.example.cn.cluster.local. 0/1/0 ns: cluster.local. [5s] SOA ns.dns.cluster.local. hostmaster.cluster.local. 1641665463 7200 1800 86400 5 (148)
14:21:50.316083 IP (tos 0x0, ttl 64, id 10992, offset 0, flags [DF], proto UDP (17), length 83)
    10.42.2.87.35181 > 10.42.0.16.domain: [bad udp cksum 0x17d3 -> 0x468e!] 25789+ AAAA? test.example.cn.cluster.local. (55)
14:21:50.316573 IP (tos 0x0, ttl 62, id 60543, offset 0, flags [DF], proto UDP (17), length 176)
    10.43.0.100.domain > 10.42.2.87.35181: [udp sum ok] 25789 NXDomain*- q: AAAA? test.example.cn.cluster.local. 0/1/0 ns: cluster.local. [5s] SOA ns.dns.cluster.local. hostmaster.cluster.local. 1641665463 7200 1800 86400 5 (148)
...

agent 服务器 flannel.1 抓包到的数据包：

...
14:21:47.814077 IP (tos 0x0, ttl 63, id 10593, offset 0, flags [DF], proto UDP (17), length 83)
    10.42.2.87.35181 > 10.42.0.16.domain: [bad udp cksum 0x17d3 -> 0x6300!] 25419+ A? test.example.cn.cluster.local. (55)
14:21:47.814100 IP (tos 0x0, ttl 63, id 10594, offset 0, flags [DF], proto UDP (17), length 83)
    10.42.2.87.35181 > 10.42.0.16.domain: [bad udp cksum 0x17d3 -> 0x468e!] 25789+ AAAA? test.example.cn.cluster.local. (55)
14:21:47.814358 IP (tos 0x0, ttl 63, id 60155, offset 0, flags [DF], proto UDP (17), length 176)
    10.42.0.16.domain > 10.42.2.87.35181: [udp sum ok] 25419 NXDomain*- q: A? test.example.cn.cluster.local. 0/1/0 ns: cluster.local. [5s] SOA ns.dns.cluster.local. hostmaster.cluster.local. 1641665463 7200 1800 86400 5 (148)
14:21:50.316100 IP (tos 0x0, ttl 63, id 10992, offset 0, flags [DF], proto UDP (17), length 83)
    10.42.2.87.35181 > 10.42.0.16.domain: [bad udp cksum 0x17d3 -> 0x468e!] 25789+ AAAA? test.example.cn.cluster.local. (55)
14:21:50.316552 IP (tos 0x0, ttl 63, id 60543, offset 0, flags [DF], proto UDP (17), length 176)
    10.42.0.16.domain > 10.42.2.87.35181: [udp sum ok] 25789 NXDomain*- q: AAAA? test.example.cn.cluster.local. 0/1/0 ns: cluster.local. [5s] SOA ns.dns.cluster.local. hostmaster.cluster.local. 1641665463 7200 1800 86400 5 (148)
...

master 服务器 flannel.1 抓包到的数据包：

...
14:21:47.819316 IP (tos 0x0, ttl 63, id 10593, offset 0, flags [DF], proto UDP (17), length 83)
    10.42.2.87.35181 > 10.42.0.16.domain: [udp sum ok] 25419+ A? test.example.cn.cluster.local. (55)
14:21:47.819323 IP (tos 0x0, ttl 63, id 10594, offset 0, flags [DF], proto UDP (17), length 83)
    10.42.2.87.35181 > 10.42.0.16.domain: [udp sum ok] 25789+ AAAA? test.example.cn.cluster.local. (55)
14:21:47.819436 IP (tos 0x0, ttl 63, id 60155, offset 0, flags [DF], proto UDP (17), length 176)
    10.42.0.16.domain > 10.42.2.87.35181: [bad udp cksum 0x1830 -> 0xbf46!] 25419 NXDomain*- q: A? test.example.cn.cluster.local. 0/1/0 ns: cluster.local. [5s] SOA ns.dns.cluster.local. hostmaster.cluster.local. 1641665463 7200 1800 86400 5 (148)
14:21:50.321330 IP (tos 0x0, ttl 63, id 10992, offset 0, flags [DF], proto UDP (17), length 83)
    10.42.2.87.35181 > 10.42.0.16.domain: [udp sum ok] 25789+ AAAA? test.example.cn.cluster.local. (55)
14:21:50.321598 IP (tos 0x0, ttl 63, id 60543, offset 0, flags [DF], proto UDP (17), length 176)
    10.42.0.16.domain > 10.42.2.87.35181: [bad udp cksum 0x1830 -> 0xa2d4!] 25789 NXDomain*- q: AAAA? test.example.cn.cluster.local. 0/1/0 ns: cluster.local. [5s] SOA ns.dns.cluster.local. hostmaster.cluster.local. 1641665463 7200 1800 86400 5 (148)
...

master 服务器 cni0 抓包到的数据包：

...
    10.42.0.16.domain > 10.42.2.87.49709: [bad udp cksum 0x1834 -> 0x4f80!] 1395 NXDomain*- q: AAAA? test.example.cn.svc.cluster.local. 0/1/0 ns: cluster.local. [5s] SOA ns.dns.cluster.local. hostmaster.cluster.local. 1641665463 7200 1800 86400 5 (152)
14:21:47.819099 IP (tos 0x0, ttl 64, id 60154, offset 0, flags [DF], proto UDP (17), length 180)
    10.42.0.16.domain > 10.42.2.87.49709: [bad udp cksum 0x1834 -> 0x6ccf!] 804 NXDomain*- q: A? test.example.cn.svc.cluster.local. 0/1/0 ns: cluster.local. [5s] SOA ns.dns.cluster.local. hostmaster.cluster.local. 1641665463 7200 1800 86400 5 (152)
14:21:47.819326 IP (tos 0x0, ttl 62, id 10593, offset 0, flags [DF], proto UDP (17), length 83)
    10.42.2.87.35181 > 10.42.0.16.domain: [udp sum ok] 25419+ A? test.example.cn.cluster.local. (55)
14:21:47.819433 IP (tos 0x0, ttl 64, id 60155, offset 0, flags [DF], proto UDP (17), length 176)
    10.42.0.16.domain > 10.42.2.87.35181: [bad udp cksum 0x1830 -> 0xbf46!] 25419 NXDomain*- q: A? test.example.cn.cluster.local. 0/1/0 ns: cluster.local. [5s] SOA ns.dns.cluster.local. hostmaster.cluster.local. 1641665463 7200 1800 86400 5 (148)
14:21:50.321340 IP (tos 0x0, ttl 62, id 10992, offset 0, flags [DF], proto UDP (17), length 83)
    10.42.2.87.35181 > 10.42.0.16.domain: [udp sum ok] 25789+ AAAA? test.example.cn.cluster.local. (55)
14:21:50.321585 IP (tos 0x0, ttl 64, id 60543, offset 0, flags [DF], proto UDP (17), length 176)
    10.42.0.16.domain > 10.42.2.87.35181: [bad udp cksum 0x1830 -> 0xa2d4!] 25789 NXDomain*- q: AAAA? test.example.cn.cluster.local. 0/1/0 ns: cluster.local. [5s] SOA ns.dns.cluster.local. hostmaster.cluster.local. 1641665463 7200 1800 86400 5 (148)
...

由以上抓包信息可以看到，在 master 服务器 cni0 抓包到的数据包，少了一个：

14:21:47.819323 IP (tos 0x0, ttl 63, id 10594, offset 0, flags [DF], proto UDP (17), length 83)
    10.42.2.87.35181 > 10.42.0.16.domain: [udp sum ok] 25789+ AAAA? test.example.cn.cluster.local. (55)

由于没有这个数据包，也就是coreDNS没有收到这个 dns 请求，所以没有返回，导致了 2.5 秒后业务 pod重发了一次 dns 请求：

14:21:50.321954 IP (tos 0x0, ttl 62, id 10994, offset 0, flags [DF], proto UDP (17), length 69)
    10.42.2.87.34314 > 10.42.0.16.domain: [udp sum ok] 37446+ AAAA? test.example.cn. (41)

这里有一个小知识，/rec/resolv.conf不设置 timeout 时，默认是 5 秒。但是自从 IPV6 以后，一次 dns 请求会是一个 A 请求加一个 AAAA 请求，所以每个请求的 timeout 是 2.5 秒。

0x05 、contrack insert fail 情况：

master 服务器：

liuxu@master:~$ sudo conntrack -S
cpu=0           found=0 invalid=1333 ignore=1963913 insert=0 insert_failed=17478 drop=17478 early_drop=0 error=2 search_restart=27053
cpu=1           found=0 invalid=615 ignore=1912454 insert=0 insert_failed=41030 drop=41030 early_drop=0 error=1 search_restart=14663

agent 服务器：

liuxu@master:~$ sudo conntrack -S
cpu=0           found=304 invalid=136 ignore=145233 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=3208
cpu=1           found=269 invalid=115 ignore=172201 insert=0 insert_failed=0 drop=0 early_drop=0 error=1 search_restart=3267
cpu=2           found=300 invalid=140 ignore=160182 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=3134
cpu=3           found=281 invalid=143 ignore=167805 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=3263

可以看到主服务有大量 insert_failed 和 drop 的数据包。

0x06 、一个值得注意的情况：

如果我将业务 pod放到coreDNS的 master 服务器上，则不会有这个问题。从以上原理可知，应该为如果业务 pod放到了主服务，则 pod 会在 cni0(10.42.0.1/24)下，和coreDNS为同一网段，不需要 DNAT 即可访问 dns 服务。

四、解决方案

根据文档：

musl libc.so 和 glibc.so 的差异：https://wiki.musl-libc.org/functional-differences-from-glibc.html

云服务商容器团队遇到此问题说明：https://tencentcloudcontainerteam.github.io/2018/10/26/DNS-5-seconds-delay/

weave 对此问题的研究和对 linux 内核的补丁：https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts

k8s 官网给出的解决方案：https://kubernetes.io/zh/docs/tasks/administer-cluster/nodelocaldns/

有以下解决方案:

a. 由于业务 pod基于 alpine linux ，所以给容器内添加额外的 nameserver 223.5.5.5 ，让 libc.so 并发向 coreDNS 和阿里云 dns 做并发请求，这样即使 master 服务器丢包，agent 服务器和阿里云的 dns 也不一定丢包。

b. 升级主服务器内核，由于 master 服务器是 debian9(内核 4.9)，而 weave 对内核的补丁合并到了 4.19 ，所以升级到 debian10(内核 4.19)即可减缓此 bug 的情况。

c. 为每台服务器或 pod 加入 dns 缓存服务，这样可以避免每次都到主服器的 coreDNS 查询。（值得一提的是，腾讯云的 TKE 可以直接安装 nodelocaldns 插件）

最终选用方案 b 解决此问题。

master 服务器：

liuxu@master:~$ sudo conntrack -S
cpu=0           found=0 invalid=155 ignore=1562433 insert=0 insert_failed=13241 drop=0 early_drop=0 error=3 search_restart=28343
cpu=1           found=0 invalid=53 ignore=182256 insert=0 insert_failed=23420 drop=0 early_drop=0 error=5 search_restart=15742

可以看到，并没有再 drop 数据包。