现在的问题是在测试环境全新启动过程中会突发所有 Pod 同时被删除重启的问题,原因是 SandboxChanged 。 > Warning FailedCreatePodSandBox 14m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "9070dfeabddbd48c08e36377868ed671240653ff4a9fd4fdf03b4af9c9b72dfe": context deadline exceeded > Normal SandboxChanged 14m (x2 over 16m) kubelet Pod sandbox changed, it will be killed and re-created.
原因是 Node 短暂失联后恢复,导致所有 Pod 重新部署。
> Warning ContainerGCFailed 31m kubelet rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/snap/microk8s/common/run/containerd.sock: connect: no such file or directory" > Normal NodeNotReady 31m kubelet Node c7-test-env-11-206 status is now: NodeNotReady
估计是太多服务同时启动,CPU 压力过高导致 Node 上的 API 超时,从而 kubelet 认为 Node 断联导致的。 将 CPU 增加到 8 个后此问题再没有重现。但因为资源不足,某些测试环境暂时无法增加 CPU 。