autovacuum (VACUUM) 是这个 PostgreSQL UPDATE 查询偶尔需要几个小时才能完成运行的原因吗?

2024-03-27

此 sql 查询通常只需要几分钟即可运行:

update import_parts ip
set part_manufacturer_id = pslc.part_manufacturer_id
from parts.part_supplier_line_codes pslc
where trim(lower(ip.line_code)) = trim(lower(pslc.supplier_line_code))
and (ip.status is null or ip.status != '6')
and ip.distributor_id = pslc.distributor_id
and ip.distributor_id = 196;

但我注意到它有时会卡住并被 2 小时的 statements_timeout 自动取消。我注意到有几次,当它卡住时,autovacuum 正在运行,并且 autovacuum 也需要很长时间才能完成运行。下面是一个实例,更新查询和 autovacuum 都在运行,但它们都需要很长时间才能完成运行:

^ 在本例中,autovacuum 在大约一个小时内完成运行,更新查询在近 2 小时内完成运行。在其他情况下,更新查询超过 2 小时的 statements_timeout,因此它会自动取消。

现在我的问题是,自动真空 (VACUUM) 是更新查询卡住或需要几个小时才能完成运行的原因吗?如果是,我该怎么做才能防止更新查询被卡住或变得如此缓慢?如果不是,您认为是什么导致更新查询被卡住或变得如此缓慢?

我们使用的是 PostgreSQL 9.6.15

UPDATE 1

我检查了我们的 RDS 实例是否耗尽了服务器资源。我们的实例大小为 db.t2.medium(2 个 vCPU、4GB RAM、1000 IOPS,存储类型为预配置 iops SSD)。

以下是过去 3 天的 CloudWatch 指标。请注意,在过去 3 天里,上面的更新 sql 查询多次卡住。

UPDATE 2

更新查询和 autovacuum 运行时唯一的活动锁:

^ 以红色突出显示的锁是由 autovacuum 创建的锁。以绿色突出显示的锁是更新查询创建的锁。

以下是更新查询和 autovacuum 运行时所有数据库连接的列表:

用红色突出显示的是 autovacuum。用绿色突出显示的是更新查询。

以下是更新查询的 EXPLAIN 结果:

parts.part_supplier_line_codes表仅包含 2758 行。表中没有 2 行或更多行具有相同的内容supplier_line_code + distributor_id.

import_parts表包含 1260 万行。

UPDATE 3

这是 EXPLAIN (ANALYZE, BUFFERS) 的结果:

EXPLAIN (ANALYZE, BUFFERS)
update import_parts ip
set part_manufacturer_id = pslc.part_manufacturer_id
from parts.part_supplier_line_codes pslc
where trim(lower(ip.line_code)) = trim(lower(pslc.supplier_line_code))
and (ip.status is null or ip.status != '6')
and ip.distributor_id = pslc.distributor_id
and ip.distributor_id = 196;

Update on import_parts ip  (cost=2967312.95..3778109.36 rows=31167172 width=156) (actual time=151475.198..151475.198 rows=0 loops=1)
  Buffers: shared hit=62369982 read=453357 dirtied=375348 written=315748, temp read=154212 written=307558
  ->  Merge Join  (cost=2967312.95..3778109.36 rows=31167172 width=156) (actual time=37567.148..84208.239 rows=10326988 loops=1)
        Merge Cond: ((btrim(lower((pslc.supplier_line_code)::text))) = (btrim(lower((ip.line_code)::text))))
        Buffers: shared hit=94397 read=78007, temp read=154212 written=307558
        ->  Sort  (cost=51.70..52.93 rows=493 width=17) (actual time=9.649..10.039 rows=494 loops=1)
              Sort Key: (btrim(lower((pslc.supplier_line_code)::text)))
              Sort Method: quicksort  Memory: 63kB
              Buffers: shared hit=7 read=11
              ->  Index Scan using index_part_supplier_line_codes_on_distributor_id on part_supplier_line_codes pslc  (cost=0.28..29.65 rows=493 width=17) (actual time=2.926..8.677 rows=494 loops=1)
                    Index Cond: (distributor_id = 196)
                    Buffers: shared hit=2 read=11
        ->  Materialize  (cost=2967261.25..3030480.67 rows=12643883 width=146) (actual time=37557.491..76400.550 rows=12642995 loops=1)
              Buffers: shared hit=94390 read=77996, temp read=154212 written=307558
              ->  Sort  (cost=2967261.25..2998870.96 rows=12643883 width=146) (actual time=37557.486..68337.525 rows=12642995 loops=1)
                    Sort Key: (btrim(lower((ip.line_code)::text)))
                    Sort Method: external merge  Disk: 1233688kB
                    Buffers: shared hit=94390 read=77996, temp read=154212 written=154212
                    ->  Seq Scan on import_parts ip  (cost=0.00..362044.24 rows=12643883 width=146) (actual time=0.027..11903.240 rows=12643918 loops=1)
                          Filter: (((status IS NULL) OR ((status)::text <> '6'::text)) AND (distributor_id = 196))
                          Buffers: shared hit=94390 read=77996
Planning time: 0.169 ms
Execution time: 151561.250 ms

UPDATE 4

鉴于我们的 AWS RDS 实例的容量为 1000 IOPS,看起来下面屏幕截图中的读取 iops 太高了?这是否可能是有时更新查询被卡住并且自动清理花费太长时间才能完成运行的原因?

读取 IOPS:

读取延迟:

写入延迟:

交换用途:

队列深度:

重新启动 AWS RDS 实例后 EXPLAIN (ANALYZE, BUFFERS) 的结果:

EXPLAIN (ANALYZE, BUFFERS)
update import_parts ip
set part_manufacturer_id = pslc.part_manufacturer_id
from parts.part_supplier_line_codes pslc
where trim(lower(ip.line_code)) = trim(lower(pslc.supplier_line_code))
and (ip.status is null or ip.status != '6')
and ip.distributor_id = pslc.distributor_id
and ip.distributor_id = 196;

Update on import_parts ip  (cost=3111484.57..3919788.11 rows=31071345 width=156) (actual time=180680.200..180680.200 rows=0 loops=1)
  Buffers: shared hit=62263174 read=712018 dirtied=386277 written=223564, temp read=237087 written=390433
  ->  Merge Join  (cost=3111484.57..3919788.11 rows=31071345 width=156) (actual time=64687.806..112959.396 rows=10326988 loops=1)
        Merge Cond: ((btrim(lower((pslc.supplier_line_code)::text))) = (btrim(lower((ip.line_code)::text))))
        Buffers: shared hit=5 read=325434, temp read=237087 written=390433
        ->  Sort  (cost=58.61..59.85 rows=493 width=17) (actual time=4.238..5.549 rows=494 loops=1)
              Sort Key: (btrim(lower((pslc.supplier_line_code)::text)))
              Sort Method: quicksort  Memory: 63kB
              Buffers: shared hit=5 read=11
              ->  Bitmap Heap Scan on part_supplier_line_codes pslc  (cost=7.40..36.56 rows=493 width=17) (actual time=2.582..3.242 rows=494 loops=1)
                    Recheck Cond: (distributor_id = 196)
                    Heap Blocks: exact=7
                    Buffers: shared read=11
                    ->  Bitmap Index Scan on index_part_supplier_line_codes_on_distributor_id  (cost=0.00..7.28 rows=493 width=0) (actual time=1.805..1.805 rows=494 loops=1)
                          Index Cond: (distributor_id = 196)
                          Buffers: shared read=4
        ->  Materialize  (cost=3111425.95..3174450.99 rows=12605008 width=146) (actual time=64683.559..105123.024 rows=12642995 loops=1)
              Buffers: shared read=325423, temp read=237087 written=390433
              ->  Sort  (cost=3111425.95..3142938.47 rows=12605008 width=146) (actual time=64683.554..96764.494 rows=12642995 loops=1)
                    Sort Key: (btrim(lower((ip.line_code)::text)))
                    Sort Method: external merge  Disk: 1233528kB
                    Buffers: shared read=325423, temp read=237087 written=237087
                    ->  Seq Scan on import_parts ip  (cost=0.00..514498.12 rows=12605008 width=146) (actual time=0.748..36768.509 rows=12643918 loops=1)
                          Filter: (((status IS NULL) OR ((status)::text <> '6'::text)) AND (distributor_id = 196))
                          Buffers: shared read=325423
Planning time: 23.127 ms
Execution time: 180803.124 ms

我重新启动 RDS 实例以清除 PostgreSQL 的缓存,看看是否存在缓存问题。我在某处读到,重新启动 PostgreSQL 将清除数据库的缓存。


Autovacuum 永远不会阻塞UPDATE反之亦然。这是一个基本的设计原则VACUUM,否则 PostgreSQL 将无法很好地工作。

autovacuum 对进程的唯一影响UPDATE是通过共享资源,最有可能的是 I/O。VACUUM造成 I/O 负载,并且使用内存和 CPU 功率。因此,您可能需要检查 autovacuum 运行时是否有任何资源不足。如果是,答案将是转向更强大的硬件:放慢 autovacuum 速度是一个坏主意,因为机器太蹩脚;从长远来看,这将导致膨胀和其他问题。

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

autovacuum (VACUUM) 是这个 PostgreSQL UPDATE 查询偶尔需要几个小时才能完成运行的原因吗? 的相关文章

随机推荐