PostgreSQL 中使用 Order By 子句进行分区

2024-02-27

我有一张包含这些值的表；

user_id ts                  val
uid1    19.05.2019 01:49:50  0
uid1    19.05.2019 01:50:15  0
uid1    19.05.2019 01:50:20  0
uid1    19.05.2019 01:59:50  1
uid1    19.05.2019 02:20:10  1
uid1    19.05.2019 02:20:15  0
uid1    19.05.2019 02:20:19  0
uid1    19.05.2019 02:30:53  1
uid1    19.05.2019 11:10:25  1
uid1    19.05.2019 11:13:40  0
uid1    19.05.2019 11:13:50  0
uid1    19.05.2019 11:20:19  1
uid2    19.05.2019 15:01:44  0
uid2    19.05.2019 15:05:55  0
uid2    19.05.2019 17:19:35  1
uid2    19.05.2019 17:20:01  0
uid2    19.05.2019 17:20:35  0
uid2    19.05.2019 19:15:50  1

当我仅使用分区子句查询该表时，结果如下所示；

Query : select *, sum(val) over (partition by user_id) as res from example_table;

user_id ts                  val res
uid1    19.05.2019 01:49:50  0  5
uid1    19.05.2019 01:50:15  0  5
uid1    19.05.2019 01:50:20  0  5
uid1    19.05.2019 01:59:50  1  5
uid1    19.05.2019 02:20:10  1  5
uid1    19.05.2019 02:20:15  0  5
uid1    19.05.2019 02:20:19  0  5
uid1    19.05.2019 02:30:53  1  5
uid1    19.05.2019 11:10:25  1  5
uid1    19.05.2019 11:13:40  0  5
uid1    19.05.2019 11:13:50  0  5
uid1    19.05.2019 11:20:19  1  5
uid2    19.05.2019 15:01:44  0  2
uid2    19.05.2019 15:05:55  0  2
uid2    19.05.2019 17:19:35  1  2
uid2    19.05.2019 17:20:01  0  2
uid2    19.05.2019 17:20:35  0  2
uid2    19.05.2019 19:15:50  1  2

在上面的结果中，res列的总和值val每个分区的列。但是，如果我使用分区依据和排序依据查询表，我会得到这些结果；

Query: select *, sum(val) over (partition by user_id order by ts) as res from example_table;

user_id ts                  val res
uid1    19.05.2019 01:49:50  0  0
uid1    19.05.2019 01:50:15  0  0
uid1    19.05.2019 01:50:20  0  0
uid1    19.05.2019 01:59:50  1  1
uid1    19.05.2019 02:20:10  1  2
uid1    19.05.2019 02:20:15  0  2
uid1    19.05.2019 02:20:19  0  2
uid1    19.05.2019 02:30:53  1  3
uid1    19.05.2019 11:10:25  1  4
uid1    19.05.2019 11:13:40  0  4
uid1    19.05.2019 11:13:50  0  4
uid1    19.05.2019 11:20:19  1  5
uid2    19.05.2019 15:01:44  0  0
uid2    19.05.2019 15:05:55  0  0
uid2    19.05.2019 17:19:35  1  1
uid2    19.05.2019 17:20:01  0  1
uid2    19.05.2019 17:20:35  0  1
uid2    19.05.2019 19:15:50  1  2

但有了 order by 子句，res列有累积和value每个分区的每一行的列。

为什么？我无法理解这一点。

Update

此行为已记录在案here https://www.postgresql.org/docs/11/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS:

4.2.8.窗口函数调用

[..] 默认的框架选项是RANGE UNBOUNDED PRECEDING，即与RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW。和ORDER BY，这将框架设置为分区中的所有行从当前行的最后一个开始ORDER BY同行。没有ORDER BY，这意味着分区的所有行都包含在窗口框架，因为所有行都成为当前行的对等行。

这意味着：

在没有一个框架子句 – RANGE UNBOUNDED PRECEDING默认使用。包括了：

根据以下规则，当前行“之前”的所有行ORDER BY clause
当前行
中具有相同值的所有行ORDER BY列作为当前行

在没有ORDER BY条款 –ORDER BY NULL是假设的（尽管我再次猜测）。就这样frame将包括来自的所有行分割，因为中的值ORDER BY列是相同的（总是NULL）在每一行。

原答案：

免责声明：以下更多的是猜测，而不是合格的答案。我没有找到任何文档可以证实我写的内容。同时，我认为当前给出的答案不能正确解释该行为。

结果差异的原因并不直接是 ORDER BY 子句，因为a + b + c是相同的c + b + a。原因是（这是我的猜测）ORDER BY 子句隐式定义了框架子句 as

rows between unbounded preceding and current row

尝试以下查询：

select *
, sum(val) over (partition by user_id) as res
, sum(val) over (partition by user_id order by ts) as res_order_by
, sum(val) over (
    partition by user_id
    order by ts
    rows between unbounded preceding and current row
  ) as res_order_by_unbounded_preceding
, sum(val) over (
    partition by user_id
    -- order by ts
    rows between unbounded preceding and current row
  ) as res_preceding
, sum(val) over (
    partition by user_id
    -- order by ts
    rows between current row and unbounded following
  ) as res_following
, sum(val) over (
    partition by user_id
    order by ts
    rows between unbounded preceding and unbounded following
  ) as res_orderby_preceding_following

from example_table;

您将看到，您可以在没有 ORDER BY 子句的情况下获得累积总和，也可以通过 ORDER BY 子句获得“完整”总和。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

sql

postgresql

sqlorderby