我的以下回答改编自https://dzone.com/articles/how-to-find-the-longest-consecutive-series-of-even https://dzone.com/articles/how-to-find-the-longest-consecutive-series-of-even用于 Spark SQL 中。您将使用以下内容包装 SQL 查询:
spark.sql("""
SQL_QUERY
""")
因此,对于第一个查询:
CREATE TABLE intermediate_1 AS
SELECT
id,
time,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY time) AS rn,
time - ROW_NUMBER() OVER (PARTITION BY id ORDER BY time) AS grp
FROM purchase
这会给你:
id, time, rn, grp
1, 1, 1, 0
1, 2, 2, 0
1, 3, 3, 0
2, 1, 1, 0
2, 3, 2, 1
2, 4, 3, 1
2, 5, 4, 1
2, 10, 5, 5
2, 11, 6, 5
3, 1, 1, 0
3, 4, 2, 2
3, 9, 3, 6
3, 11, 4, 7
我们可以看到连续的行具有相同的 grp 值。然后我们将使用 GROUP BY 和 COUNT 来获取连续时间的数量。
CREATE TABLE intermediate_2 AS
SELECT
id,
grp,
COUNT(*) AS num_consecutive
FROM intermediate_1
GROUP BY id, grp
这将返回:
id, grp, num_consecutive
1, 0, 3
2, 0, 1
2, 1, 3
2, 5, 2
3, 0, 1
3, 2, 1
3, 6, 1
3, 7, 1
现在我们只需使用 MAX 和 GROUP BY 即可获得最大连续次数。
CREATE TABLE final AS
SELECT
id,
MAX(num_consecutive) as max_consecutive
FROM intermediate_2
GROUP BY id
这会给你:
id, max_consecutive
1, 3
2, 3
3, 1
希望这可以帮助!