PromQL 查询查找上周使用的 CPU 和内存

2024-06-23

我正在尝试编写一个 Prometheus 查询，它可以告诉我每个命名空间在一段时间内（比如一周）使用了多少 CPU（以及另一个用于内存和网络的百分比）。

我尝试使用的指标是container_spec_cpu_shares and container_memory_working_set_bytes但我无法弄清楚随着时间的推移如何将它们相加。无论我尝试什么，要么返回 0，要么返回错误。

任何有关如何为此编写查询的帮助将不胜感激。

要检查每个名称空间使用的内存百分比，您将需要类似于以下的查询：

sum( container_memory_working_set_bytes{container="", namespace=~".+"} )|
by (namespace) / ignoring (namespace) group_left 
sum( machine_memory_bytes{}) * 100

上面的查询应该生成一个与此类似的图表：

免责声明！：

上面的屏幕截图来自 Grafana，以便更好地查看。

此查询不确认可用 RAM 中的更改（节点更改、节点自动缩放等）。

要在 PromQL 中获取一段时间内的指标，您将需要使用其他函数，例如：

avg_over_time(EXP[time]).

要返回过去并计算特定时间点的资源，您将需要使用：

offset TIME

使用上述指针查询应组合为：

avg_over_time( sum(container_memory_working_set_bytes{container="", namespace=~".+"} offset 45m) by (namespace)[120m:])  / ignoring (namespace) group_left 
sum( machine_memory_bytes{})

上面的查询将计算每个命名空间使用的平均内存百分比，并将其除以从当前时间起 120 分钟内集群中的所有内存。它还将比现在提前 45 分钟开始。

Example:

查询运行时间：20:00
avg_over_time(EXPR[2h:])
offset 45 min

上面的示例将从 17:15 开始，并将查询运行到 19:15。您可以修改它以包括整周:)。

如果您想按命名空间计算 CPU 使用情况，可以将此指标替换为以下指标：

container_cpu_usage_seconds_total{}- 请检查rate()使用此指标时的函数（计数器）
machine_cpu_cores{}

您还可以查看此网络指标：

container_network_receive_bytes_total- 请检查rate()使用此指标时的函数（计数器）
container_network_transmit_bytes_total- 请检查rate()使用此指标时的函数（计数器）

我在下面提供了更多解释，包括示例（内存）、测试方法和对所用查询的剖析。

我们假设：

Kubernetes cluster 1.18.6 (Kubespray) with 12GB of memory in total:
- 主节点与2GB记忆的
- 工人一节点8GB记忆的
- 工作两个节点2GB记忆的
Prometheus 和 Grafana 安装有：Github.com：Coreos：Kube-prometheus https://github.com/coreos/kube-prometheus
Namespace kruk with single ubuntu pod set to generate artificial load with below command:
- $ stress-ng --vm 1 --vm-bytes <AMOUNT_OF_RAM_USED> --vm-method all -t 60m -v

人工负载是通过以下方式生成的stress-ng两次：

60 分钟 -1GB使用的内存
60 分钟 -2GB使用的内存

命名空间使用的内存百分比kruk在这个时间跨度内：

1GB，约占集群中所有内存 (12GB) 的 8.5%
2GB，约占集群中所有内存 (12GB) 的 17.5%

Prometheus 查询的负载kruk命名空间看起来像这样：

计算使用avg_over_time(EXPR[time:]) / memory in the cluster显示使用率约为 13% ((17.5+8.5)/2) 查询人工荷载产生的时间时。这应该表明查询是正确的：

至于使用的查询：

avg_over_time( sum( container_memory_working_set_bytes{container="", namespace="kruk"} offset 1380m )
by (namespace)[120m:]) / ignoring (namespace) group_left 
sum( machine_memory_bytes{}) * 100

上面的查询与一开始的查询非常相似，但我做了一些更改以仅显示kruk命名空间。

我将查询解释分为两部分（被除数/除数）。

Dividend

container_memory_working_set_bytes{container="", namespace="kruk"}

该指标将输出命名空间中的内存使用记录kruk。如果您要查询所有名称空间，请查看附加说明：

namespace=~".+"
container=""container=""仅当容器值为空时才匹配（下面引文中的最后一行）。

container_memory_working_set_bytes{container="POD",endpoint="https-metrics",id="/kubepods/podab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b/e249c12010a27f82389ebfff3c7c133f2a5da19799d2f5bb794bcdb5dc5f8bca",image="k8s.gcr.io/pause:3.2",instance="192.168.0.124:10250",job="kubelet",metrics_path="/metrics/cadvisor",name="k8s_POD_ubuntu_kruk_ab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b_0",namespace="kruk",node="worker-one",pod="ubuntu",service="kubelet"} 692224
container_memory_working_set_bytes{container="ubuntu",endpoint="https-metrics",id="/kubepods/podab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b/fae287e7043ff00da16b6e6a8688bfba0bfe30634c52e7563fcf18ac5850f6d9",image="ubuntu@sha256:5d1d5407f353843ecf8b16524bc5565aa332e9e6a1297c73a92d3e754b8a636d",instance="192.168.0.124:10250",job="kubelet",metrics_path="/metrics/cadvisor",name="k8s_ubuntu_ubuntu_kruk_ab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b_0",namespace="kruk",node="worker-one",pod="ubuntu",service="kubelet"} 2186403840
container_memory_working_set_bytes{endpoint="https-metrics",id="/kubepods/podab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b",instance="192.168.0.124:10250",job="kubelet",metrics_path="/metrics/cadvisor",namespace="kruk",node="worker-one",pod="ubuntu",service="kubelet"} 2187096064

您可以在此处阅读有关暂停容器的更多信息：

Ianlewis.org：全能暂停容器 https://www.ianlewis.org/en/almighty-pause-container

sum( container_memory_working_set_bytes{container="", namespace="kruk"} offset 1380m )
by (namespace)

该查询将按各自的命名空间对结果进行求和。offset 1380m用于回到过去进行测试的时间。

avg_over_time( sum( container_memory_working_set_bytes{container="", namespace="kruk"} offset 1380m )
by (namespace)[120m:])

此查询将从比当前时间早 1380m 开始的指定时间（到现在 120m）跨命名空间的内存指标计算平均值。

您可以阅读更多有关avg_over_time() here:

Prometheus.io：随时间的聚合 https://prometheus.io/docs/prometheus/latest/querying/functions/#aggregation_over_time
Prometheus.io：博客：子查询支持 https://prometheus.io/blog/2019/01/28/subquery-support/

Divisor

sum( machine_memory_bytes{})

该指标将汇总集群中每个节点的可用内存。

EXPR / ignoring (namespace) group_left 
sum( machine_memory_bytes{}) * 100

专注于：

/ ignoring (namespace) group_leftPrometheus.io：矢量匹配 https://prometheus.io/docs/prometheus/latest/querying/operators/#vector-matching
* 100是相当不言自明的，并将结果乘以 100，看起来更像百分比。

其他资源：

Prometheus.io：查询：基础知识 https://prometheus.io/docs/prometheus/latest/querying/basics/
Timber.io：博客：人类的 Promql https://timber.io/blog/promql-for-humans/
Grafana.com：仪表板：315 https://grafana.com/grafana/dashboards/315

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

Kubernetes

Prometheus

PromQL