我在 Google Kubernetes 集群上启用了自动缩放,并且我可以看到其中一个 Pod 的使用率要低得多
我总共有 6 个节点,我预计至少有这个节点被终止。我经历了以下事情:https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node
我已将此注释添加到我的所有 pod 中
cluster-autoscaler.kubernetes.io/safe-to-evict: true
但是,集群自动缩放程序可以正确扩展,但不会像我预期的那样缩小。
我有以下日志
$ kubectl logs kube-dns-autoscaler-76fcd5f658-mf85c -n kube-system
autoscaler/pkg/autoscaler/k8sclient/k8sclient.go:90: Failed to list *v1.Node: Get https://10.55.240.1:443/api/v1/nodes?resourceVersion=0: dial tcp 10.55.240.1:443: getsockopt: connection refused
E0628 20:34:36.187949 1 reflector.go:190] github.com/kubernetes-incubator/cluster-proportional-autoscaler/pkg/autoscaler/k8sclient/k8sclient.go:90: Failed to list *v1.Node: Get https://10.55.240.1:443/api/v1/nodes?resourceVersion=0: dial tcp 10.55.240.1:443: getsockopt: connection refused
E0628 20:34:47.191061 1 reflector.go:190] github.com/kubernetes-incubator/cluster-proportional-autoscaler/pkg/autoscaler/k8sclient/k8sclient.go:90: Failed to list *v1.Node: Get https://10.55.240.1:443/api/v1/nodes?resourceVersion=0: net/http: TLS handshake timeout
I0628 20:35:10.248636 1 autoscaler_server.go:133] ConfigMap not found: Get https://10.55.240.1:443/api/v1/namespaces/kube-system/configmaps/kube-dns-autoscaler: net/http: TLS handshake timeout, will create one with default params
E0628 20:35:17.356197 1 autoscaler_server.go:95] Error syncing configMap with apiserver: configmaps "kube-dns-autoscaler" already exists
E0628 20:35:18.191979 1 reflector.go:190] github.com/kubernetes-incubator/cluster-proportional-autoscaler/pkg/autoscaler/k8sclient/k8sclient.go:90: Failed to list *v1.Node: Get https://10.55.240.1:443/api/v1/nodes?resourceVersion=0: dial tcp 10.55.240.1:443: i/o timeout
我不确定以上是否是相关日志,调试此问题的正确方法是什么?
我的 Pod 有本地存储。我一直在尝试使用来调试这个问题
kubectl drain gke-mynode-d57ded4e-k8tt
error: DaemonSet-managed pods (use --ignore-daemonsets to ignore): fluentd-gcp-v3.1.1-qzdzs, prometheus-to-sd-snqtn; pods with local storage (use --delete-local-data to override): mydocs-585879b4d5-g9flr, istio-ingressgateway-9b889644-v8bgq, mydocs-585879b4d5-7lmzk
我认为忽略是安全的daemonsets
因为 CA 应该可以驱逐它,但是我不确定如何让 CA 理解 mydocs 可以被驱逐并在添加注释后移动到另一个节点
EDIT
The min and the max nodes have been set correctly as seen on the GCP console