背景
我正在一个非常受限的 T-SQL 环境中工作,其中只能定义一个对象的“主体”VIEW
: 大概是...
in
CREATE VIEW My_View AS ...
在引擎盖下。这是我的@@VERSION https://learn.microsoft.com/en-us/sql/t-sql/functions/version-transact-sql-configuration-functions?view=sql-server-linux-ver15:
Microsoft SQL Server 2019 (RTM-CU19) (KB5023049) - 15.0.4298.1 (X64)
2023 年 1 月 27 日 16:44:09
版权所有 (C) 2019 微软公司
Linux 上的网络版(64 位)(Amazon Linux 2)
我应该注意到这些表是从同步的平面文件,因此,没有formal保留了原始来源的示意性结构。也就是说,所有“功能依赖”都仅仅是inferred(尽管可靠)来自列名称和业务概念。
Problem
假设我有下表My_Measures
...
Person_ID |
Name |
Measure |
1 |
Greg |
0 |
1 |
Greg |
10 |
2 |
Devon |
20 |
2 |
Devon |
30 |
...在哪里Name
功能上依赖于Person_ID
.
通常
现在假设我希望总计的 the Measure
进入每个人的各种汇总统计数据。这在 SQL 中很简单......
SELECT
Person_ID,
MIN(Measure) AS Min_Measure,
MAX(Measure) AS Max_Measure,
AVG(Measure) AS Avg_Measure
FROM
My_Measures
GROUP BY
Person_ID
...并产生以下结果:
Person_ID |
Min_Measure |
Max_Measure |
Avg_Measure |
1 |
0 |
10 |
5 |
2 |
20 |
30 |
25 |
扭曲
但假设我想包括Name
并排每个Person_ID
,像这样:
Person_ID |
Name |
Min_Measure |
Max_Measure |
Avg_Measure |
1 |
Greg |
0 |
10 |
5 |
2 |
Devon |
20 |
30 |
25 |
显然,以下尝试...
SELECT
Person_ID,
-- ⌄⌄⌄⌄⌄
Name,
-- ^^^^^
MIN(Measure) AS Min_Measure,
MAX(Measure) AS Max_Measure,
AVG(Measure) AS Avg_Measure
FROM
My_Measures
GROUP BY
Person_ID
...将因以下错误而失败:
列“My_Measures.Name”在选择列表中无效,因为它未包含在聚合函数或 GROUP BY 子句中。
Attempts
我找到了几个un产生预期输出的令人满意的方法。
(1) GROUP BY
因变量
一种方法是GROUP BY
the Name
column after Person_ID
;更一般地说,将因变量附加到end of the GROUP BY
clause:
SELECT
Person_ID,
-- ⌄⌄⌄⌄⌄
Name,
-- ^^^^^
MIN(Measure) AS Min_Measure,
MAX(Measure) AS Max_Measure,
AVG(Measure) AS Avg_Measure
FROM
My_Measures
GROUP BY
-- ⌄⌄⌄⌄⌄⌄
Person_ID, Name
-- ^^^^^^
这使得分组保持不变,因为“真实”分组变量(此处Person_ID
)已经定义了它,因变量只是“跟随”。然而,这会浪费处理(任意多个)因变量的性能,这可能会更复杂(CHAR
字符串为Name
) 用于索引目的。
(2)“聚合”因变量
另一种方法是“聚合”Name
列,具有某些功能(例如MIN()
)这给了我们一个代表值(例如'Greg'
)来自许多相同的重复项(例如('Greg', 'Greg')
).
SELECT
Person_ID,
-- ⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄
MIN(Name) AS Name,
-- ^^^^^^^^^^^^^^^^^^^^^
MIN(Measure) AS Min_Measure,
MAX(Measure) AS Max_Measure,
AVG(Measure) AS Avg_Measure
FROM
My_Measures
GROUP BY
Person_ID
这同样达到了预期的结果,但同样浪费了计算许多相同值的聚合的性能。此外,它仅适用于以下值可比从而有一个MIN()
;但它显然会失败non-可比较的数据类型。
(3) ReJOIN
聚合后
也许最令人失望的方法是简单地计算聚合,然后重新关联Person_ID
以其Name
via a JOIN
:
-- Aggregate by ID.
WITH agg AS(
SELECT
Person_ID,
MIN(Measure) AS Min_Measure,
MAX(Measure) AS Max_Measure,
AVG(Measure) AS Avg_Measure
FROM
My_Measures
GROUP BY
Person_ID
-- Deduplicate names for the JOIN. Given functional dependency, DISTINCT suffices.
), msr AS (
SELECT DISTINCT
Person_ID,
Name
FROM My_Measures
-- Reassociate the names with their IDs.
) SELECT
agg.Person_ID AS Person_ID,
-- ⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄
msr.Name AS Name,
-- ^^^^^^^^^^^^^^^^^^^^^^^^
agg.Min_Measure AS Min_Measure,
agg.Max_Measure AS Max_Measure,
agg.Avg_Measure AS Avg_Measure
FROM
-- ⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄
agg INNER JOIN msr
-- ^^^^^^^^^^^^^^
ON agg.Person_ID = msr.Person_ID
显然,这将大量资源浪费在不必要的事情上JOIN
和多个 CTE,全部都是为了recover数据(如Name
)我们原来had!
(4) 采取FIRST_VALUE()
Over a PARTITION
我在 T-SQL 中搜索了一些等效的内容first() https://dplyr.tidyverse.org/reference/nth.htmlR 中的函数。在 SQL 中,这样的FIRST()
会简单地选择非常first value ('Greg'
)来自许多相同的重复项(('Greg', 'Greg')
)在一个GROUP
,不需要任何昂贵的计算。此外,这会起作用不管的可比性。
我偶然发现了FIRST_VALUE() https://learn.microsoft.com/en-us/sql/t-sql/functions/first-value-transact-sql?view=sql-server-linux-ver15功能,但这似乎需要一个PARTITION
对于每种用法,以及 - 由于我在优化方面相对缺乏经验PARTITION
s — 我担心对性能的影响,如果many因变量必须是SELECT
ed.
它看起来也很丑。  ̄\(ツ)/¯
Question
最好的方法是什么SELECT
任意一组因变量(例如Name
)以及分组变量(例如Person_ID
)?请优先考虑表现,但还要考虑elegance and 正规最后可扩展性:这应该理想地适用于all数据类型,甚至non-可比。