问题摘要我们曾多次观察到,我们的集群运行时遇到这样的问题:一个或多个节点上的一个或多个 Pod 无法启动(Pod 内的一个或多个容器未启动)。 Pod 显示“创建 Pod 沙箱失败”错误。在“受影响的”节点上重新启动 docker 或 kubelet 并不能解决问题。此外,终止并重新创建受影响的 EC2 实例并不能解决问题。如果一个 pod(启动失败的和“健康”的)被重新调度到其他节点上,它们将无法启动并出现相同的错误。因此,只要至少一个 pod 因“创建 pod 沙箱失败”错误而无法启动,整个集群就会被破坏。我们没有找到修复集群的方法 - 只有完全重新创建集群才能帮助我们的应用程序启动并运行。
环境* 库伯内特斯 1.9.3
* 使用在 AWS 上运行的 kops (v 1.8) 创建的集群(专用网络拓扑)
* 网络:weave-net
* 集群:1个主节点,3个节点
事件实例时间线
- 我们使用 kops 执行滚动集群更新,以使用我们构建的新 AMI(基于 kops ami k8s-1.8-debian-jessie-amd64-hvm-ebs-2017-11-27)启动节点和主节点。调整 kops AMI 从来都不是问题 - 我们应用了 Linux 更新和一些与 Kubernetes 无关的小更改。
- 在 Kubernetes 仪表板中注意到,其中一个 pod(代理 pod)无法启动,并出现以下错误:0/4 个节点可用:1 NodeUnschedulable、1 PodToleratesNodeTaints、3 NoVolumeZoneConflich。创建 pod 沙箱失败。
- 由守护进程集(fluidd pod)创建的另一个 pod 与上面的调度在同一节点上,但出现了略有不同的错误:网络未就绪:[运行时网络未就绪:NetworkReady=false 原因:NetworkPluginNotReady 消息:docker:网络插件未安装准备好:cni clnfig 未初始化]
*根据 Kubernetes 的数据,受影响的节点是健康的。使用kubectl描述节点查看,受影响的节点有足够的资源来运行pod
-
查看 kubelet 日志,我们观察到 Broker Pod 的以下内容:
Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.364224 1346 kubelet.go:1835] SyncLoop (ADD, "api"): "broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)"
Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.512797 1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "authservice-ca" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-authservice-ca") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.512862 1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "broker-internal-cert" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-broker-internal-cert") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.512934 1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "ssl-certs" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-ssl-certs") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.512958 1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.512980 1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "broker-prometheus-config" (UniqueName: "kubernetes.io/configmap/8ead64a3-28f3-11e8-b520-025c267c6ea8-broker-prometheus-config") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.513002 1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-vrhqr" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-default-token-vrhqr") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: E0316 08:29:54.513117 1346 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280\"" failed. No retries permitted until 2018-03-16 08:29:55.013051677 +0000 UTC m=+33.865177055 (durationBeforeRetry 500ms). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa\" (UniqueName: \"kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280\") pod \"broker-0\" (UID: \"8ead64a3-28f3-11e8-b520-025c267c6ea8\") "
Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.613329 1346 reconciler.go:262] operationExecutor.MountVolume started for volume "broker-prometheus-config" (UniqueName: "kubernetes.io/configmap/8ead64a3-28f3-11e8-b520-025c267c6ea8-broker-prometheus-config") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.613544 1346 reconciler.go:262] operationExecutor.MountVolume started for volume "default-token-vrhqr" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-default-token-vrhqr") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.613614 1346 reconciler.go:262] operationExecutor.MountVolume started for volume "authservice-ca" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-authservice-ca") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.613660 1346 reconciler.go:262] operationExecutor.MountVolume started for volume "broker-internal-cert" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-broker-internal-cert") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.613698 1346 reconciler.go:262] operationExecutor.MountVolume started for volume "ssl-certs" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-ssl-certs") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.616720 1346 operation_generator.go:522] MountVolume.SetUp succeeded for volume "broker-prometheus-config" (UniqueName: "kubernetes.io/configmap/8ead64a3-28f3-11e8-b520-025c267c6ea8-broker-prometheus-config") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.621948 1346 operation_generator.go:522] MountVolume.SetUp succeeded for volume "ssl-certs" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-ssl-certs") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.624467 1346 operation_generator.go:522] MountVolume.SetUp succeeded for volume "authservice-ca" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-authservice-ca") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.626039 1346 operation_generator.go:522] MountVolume.SetUp succeeded for volume "broker-internal-cert" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-broker-internal-cert") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.626604 1346 operation_generator.go:522] MountVolume.SetUp succeeded for volume "default-token-vrhqr" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-default-token-vrhqr") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:29:55 ip-172-20-85-48 kubelet[1346]: I0316 08:29:55.014972 1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:29:55 ip-172-20-85-48 kubelet[1346]: E0316 08:29:55.015086 1346 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280\"" failed. No retries permitted until 2018-03-16 08:29:56.015048945 +0000 UTC m=+34.867174315 (durationBeforeRetry 1s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa\" (UniqueName: \"kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280\") pod \"broker-0\" (UID: \"8ead64a3-28f3-11e8-b520-025c267c6ea8\") "
Mar 16 08:29:56 ip-172-20-85-48 kubelet[1346]: I0316 08:29:56.017902 1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:29:56 ip-172-20-85-48 kubelet[1346]: E0316 08:29:56.018024 1346 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280\"" failed. No retries permitted until 2018-03-16 08:29:58.017982038 +0000 UTC m=+36.870107444 (durationBeforeRetry 2s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa\" (UniqueName: \"kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280\") pod \"broker-0\" (UID: \"8ead64a3-28f3-11e8-b520-025c267c6ea8\") "
Mar 16 08:29:58 ip-172-20-85-48 kubelet[1346]: I0316 08:29:58.023714 1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:29:58 ip-172-20-85-48 kubelet[1346]: E0316 08:29:58.023871 1346 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280\"" failed. No retries permitted until 2018-03-16 08:30:02.023814124 +0000 UTC m=+40.875939520 (durationBeforeRetry 4s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa\" (UniqueName: \"kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280\") pod \"broker-0\" (UID: \"8ead64a3-28f3-11e8-b520-025c267c6ea8\") "
Mar 16 08:30:02 ip-172-20-85-48 kubelet[1346]: I0316 08:30:02.033951 1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:30:02 ip-172-20-85-48 kubelet[1346]: E0316 08:30:02.034045 1346 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280\"" failed. No retries permitted until 2018-03-16 08:30:10.034017896 +0000 UTC m=+48.886143256 (durationBeforeRetry 8s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa\" (UniqueName: \"kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280\") pod \"broker-0\" (UID: \"8ead64a3-28f3-11e8-b520-025c267c6ea8\") "
Mar 16 08:30:10 ip-172-20-85-48 kubelet[1346]: I0316 08:30:10.055813 1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:30:10 ip-172-20-85-48 kubelet[1346]: I0316 08:30:10.060992 1346 operation_generator.go:1111] Controller attach succeeded for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8") device path: "/dev/xvdcr"
Mar 16 08:30:10 ip-172-20-85-48 kubelet[1346]: I0316 08:30:10.156111 1346 reconciler.go:262] operationExecutor.MountVolume started for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:30:10 ip-172-20-85-48 kubelet[1346]: I0316 08:30:10.156188 1346 operation_generator.go:446] MountVolume.WaitForAttach entering for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8") DevicePath "/dev/xvdcr"
Mar 16 08:30:11 ip-172-20-85-48 kubelet[1346]: I0316 08:30:11.156560 1346 operation_generator.go:455] MountVolume.WaitForAttach succeeded for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8") DevicePath "/dev/xvdcr"
Mar 16 08:30:12 ip-172-20-85-48 kubelet[1346]: I0316 08:30:12.128286 1346 operation_generator.go:484] MountVolume.MountDevice succeeded for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8") device mount path "/var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/eu-central-1b/vol-04145a1c9d1a26280"
Mar 16 08:30:12 ip-172-20-85-48 kubelet[1346]: I0316 08:30:12.370928 1346 operation_generator.go:522] MountVolume.SetUp succeeded for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
Mar 16 08:30:12 ip-172-20-85-48 kubelet[1346]: I0316 08:30:12.672408 1346 kuberuntime_manager.go:385] No sandbox for pod "broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)" can be found. Need to start a new one
Mar 16 08:30:15 ip-172-20-85-48 kubelet[1346]: I0316 08:30:15.337311 1346 kubelet.go:1880] SyncLoop (PLEG): "broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)", event: &pleg.PodLifecycleEvent{ID:"8ead64a3-28f3-11e8-b520-025c267c6ea8", Type:"ContainerStarted", Data:"b08ea5b45ce3ba467856952ad6cc095f4b796673d7dfbf3b9c4029b6b1a75a1b"}
Mar 16 08:34:12 ip-172-20-85-48 kubelet[1346]: E0316 08:34:12.672940 1346 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Mar 16 08:34:12 ip-172-20-85-48 kubelet[1346]: E0316 08:34:12.672956 1346 kuberuntime_manager.go:647] createPodSandbox for pod "broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Mar 16 08:34:12 ip-172-20-85-48 kubelet[1346]: E0316 08:34:12.673020 1346 pod_workers.go:186] Error syncing pod 8ead64a3-28f3-11e8-b520-025c267c6ea8 ("broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)"), skipping: failed to "CreatePodSandbox" for "broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)" with CreatePodSandboxError: "CreatePodSandbox for pod \"broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)\" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Mar 16 08:34:12 ip-172-20-85-48 kubelet[1346]: E0316 08:34:12.673020 1346 pod_workers.go:186] Error syncing pod 8ead64a3-28f3-11e8-b520-025c267c6ea8 ("broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)"), skipping: failed to "CreatePodSandbox" for "broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)" with CreatePodSandboxError: "CreatePodSandbox for pod \"broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)\" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Mar 16 08:34:12 ip-172-20-85-48 kubelet[1346]: E0316 08:34:12.673020 1346 pod_workers.go:186] Error syncing pod 8ead64a3-28f3-11e8-b520-025c267c6ea8 ("broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)"), skipping: failed to "CreatePodSandbox" for "broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)" with CreatePodSandboxError: "CreatePodSandbox for pod \"broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)\" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Mar 16 08:34:13 ip-172-20-85-48 kubelet[1346]: I0316 08:34:13.301580 1346 kuberuntime_manager.go:416] Sandbox for pod "broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)" has no IP address. Need to start a new one
Mar 16 08:34:14 ip-172-20-85-48 kubelet[1346]: I0316 08:34:14.005589 1346 kubelet.go:1880] SyncLoop (PLEG): "broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)", event: &pleg.PodLifecycleEvent{ID:"8ead64a3-28f3-11e8-b520-025c267c6ea8", Type:"ContainerDied", Data:"b08ea5b45ce3ba467856952ad6cc095f4b796673d7dfbf3b9c4029b6b1a75a1b"}
通过在 docker 守护进程和 kubelet 重新启动后强制将另一个 pod (adapter-mqtt-vertx) 重新安排到“受影响的节点”上,重现了另一个 pod (adapter-mqtt-vertx) 的问题,产生了类似的结果
有谁知道这里的问题是什么以及补救措施是什么?
[runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready:cni clnfig uninitialized]
这篇文章中有很多信息,但据我所知,pod 沙箱的问题总是 SDN 和/或 CNI 错误配置。我会确保/etc/cni/net.d
, /opt/cni/bin
, and 相关机械 https://kubernetes.io/docs/concepts/cluster-administration/network-plugins/都很酷。遗憾的是我的经验太少了调试CNI 的东西,通常我的问题只是因为忘记了其中一个目录(或拥有错误的权限)而导致的。我肯定没有任何建设性的东西可以提供调试 Weave :-(
希望您的情况也是一个“简单”问题,或者我希望其他人能够权衡谁比我拥有更多的 CNI 或 Weave 调试技能
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)