以下是我的输入数据集:
df = spark.createDataFrame([ \
("0","CattyCat","B2K","B"), \
("0","CattyCat","B3L","I"), \
("0","CattyCat","B3U","I"), \
("0","CattyCat","D3J","C"), \
("0","CattyCat","J1N","H"), \
("0","CattyCat","K7A","I"), \
("0","CattyCat","L1B","D"), \
("0","CattyCat","U3F","B"), \
("1","CattyCat","B2K","I"), \
("1","CattyCat","B3L","I"), \
("1","CattyCat","B3U","I"), \
("1","CattyCat","D3J","C"), \
("1","CattyCat","J1N","H"), \
("1","CattyCat","K7A","I"), \
("1","CattyCat","L1B","D"), \
("1","CattyCat","U3F","B"), \
("2","CattyCat","B2K","B"), \
("2","CattyCat","B3L","B"), \
("2","CattyCat","B3U","I"), \
("2","CattyCat","D3J","C"), \
("2","CattyCat","J1N","H"), \
("2","CattyCat","K7A","I"), \
("2","CattyCat","L1B","D"), \
("2","CattyCat","U3F","B"), \
], ["RowCount","CatName","Name","Value"])
df.show(30)
+--------+--------+----+-----+
|RowCount| CatName|Name|Value|
+--------+--------+----+-----+
| 0|CattyCat| B2K| B|
| 0|CattyCat| B3L| I|
| 0|CattyCat| B3U| I|
| 0|CattyCat| D3J| C|
| 0|CattyCat| J1N| H|
| 0|CattyCat| K7A| I|
| 0|CattyCat| L1B| D|
| 0|CattyCat| U3F| B|
| 1|CattyCat| B2K| I|
| 1|CattyCat| B3L| I|
| 1|CattyCat| B3U| I|
| 1|CattyCat| D3J| C|
| 1|CattyCat| J1N| H|
| 1|CattyCat| K7A| I|
| 1|CattyCat| L1B| D|
| 1|CattyCat| U3F| B|
| 2|CattyCat| B2K| B|
| 2|CattyCat| B3L| B|
| 2|CattyCat| B3U| I|
| 2|CattyCat| D3J| C|
| 2|CattyCat| J1N| H|
| 2|CattyCat| K7A| I|
| 2|CattyCat| L1B| D|
| 2|CattyCat| U3F| B|
+--------+--------+----+-----+
我的目标是对这些数据进行透视\交叉制表。我能够使用 groupby.pivot.agg 实现此目的,如下所示:
import pyspark.sql.functions as F
display(df.groupBy("RowCount","CatName").pivot("Name").agg(F.first("value")))
+----------+----------+-----+-----+-----+-----+-----+-----+-----+-----+
| RowCount | CatName | B2K | B3L | B3U | D3J | J1N | K7A | L1B | U3F |
+----------+----------+-----+-----+-----+-----+-----+-----+-----+-----+
| 0 | CattyCat | B | I | I | C | H | I | D | B |
+----------+----------+-----+-----+-----+-----+-----+-----+-----+-----+
| 1 | CattyCat | I | I | I | C | H | I | D | B |
+----------+----------+-----+-----+-----+-----+-----+-----+-----+-----+
| 2 | CattyCat | B | B | I | C | H | I | D | B |
+----------+----------+-----+-----+-----+-----+-----+-----+-----+-----+
但我面临的问题是,当数据集很大(数百或数百万)时,性能非常差。 (单个执行器最后阶段的单个任务,卡了几个小时)
P.S:我还发现数据透视表还可以采用第二个参数,该参数可以是一系列列名称,这可能会提供更好的性能。但不幸的是我无法提前知道这些列名称。
有没有办法以更好的性能方式执行此“交叉选项卡”?