想象一下我有一张这样的表:
CREATE TABLE time_series (
snapshot_date DATE,
sales INTEGER,
PRIMARY KEY (snapshot_date));
具有这样的值:
INSERT INTO time_series SELECT '2017-01-01'::DATE AS snapshot_date,10 AS sales;
INSERT INTO time_series SELECT '2017-01-02'::DATE AS snapshot_date,4 AS sales;
INSERT INTO time_series SELECT '2017-01-03'::DATE AS snapshot_date,13 AS sales;
INSERT INTO time_series SELECT '2017-01-04'::DATE AS snapshot_date,7 AS sales;
INSERT INTO time_series SELECT '2017-01-05'::DATE AS snapshot_date,15 AS sales;
INSERT INTO time_series SELECT '2017-01-06'::DATE AS snapshot_date,8 AS sales;
我希望能够做到这一点:
SELECT a.snapshot_date,
AVG(b.sales) AS sales_avg,
COUNT(*) AS COUNT
FROM time_series AS a
JOIN time_series AS b
ON a.snapshot_date > b.snapshot_date
GROUP BY a.snapshot_date
产生如下结果:
*---------------*-----------*-------*
| snapshot_date | sales_avg | count |
*---------------*-----------*-------*
| 2017-01-02 | 10.0 | 1 |
| 2017-01-03 | 7.0 | 2 |
| 2017-01-04 | 9.0 | 3 |
| 2017-01-05 | 8.5 | 4 |
| 2017-01-06 | 9.8 | 5 |
-------------------------------------
对于很少数量的行(如本例所示),查询运行得非常快。问题是我必须对数百万行执行此操作,而在 Redshift(语法与 Postgres 类似)上,我的查询需要几天的时间才能运行。它非常慢,但这是我最常见的查询模式之一。我怀疑问题是由于数据中 O(n^2) 的增长与更优选的 O(n) 造成的。
我在 python 中的 O(n) 实现将是这样的:
rows = [('2017-01-01',10),
('2017-01-02',4),
('2017-01-03',13),
('2017-01-04',7),
('2017-01-05',15),
('2017-01-06',8)]
sales_total_previous = 0
count = 0
for index, row in enumerate(rows):
snapshot_date = row[0]
sales = row[1]
if index == 0:
sales_total_previous += sales
continue
count += 1
sales_avg = sales_total_previous / count
print((snapshot_date,sales_avg, count))
sales_total_previous += sales
结果如下(与 SQL 查询相同):
('2017-01-02', 10.0, 1)
('2017-01-03', 7.0, 2)
('2017-01-04', 9.0, 3)
('2017-01-05', 8.5, 4)
('2017-01-06', 9.8, 5)
我正在考虑切换到 Apache Spark,以便我可以准确地执行该 python 查询,但几百万行实际上并不是那么大(最多 3-4 GB),并且使用具有 100 GB RAM 的 Spark 集群似乎矫枉过正。有没有一种有效且易于阅读的方法可以在 SQL 中获得 O(n) 效率,最好是在 Postgres / Redshift 中?