首先,让我们重新定义映射到 group bychannel
并返回MapType
Column
(toolz https://github.com/pytoolz/toolz很方便,但可以替换为itertools.chain
)*:
from toolz import concat, interleave
from pyspark.sql.functions import col, create_map, lit, struct
# Create literal column from id to sensor -> channel map
channel_map = create_map(*concat((lit(k), v) for k, v in sensor_channel_df
.groupby("id")
# Create map Column from literal label to channel
.apply(lambda grp: create_map(*interleave([
map(lit, grp["sensor"]),
map(col, grp["channel"])])))
.to_dict()
.items()))
接下来,获取传感器列表:
sensors = sorted(sensor_channel_df["sensor"].unique().tolist())
并合并数据列:
df = spark.createDataFrame(data_df)
data_cols = struct(*[c for c in df.columns if c != "id"])
上面定义的组件可以组合:
cols = [channel_map[col("id")][sensor].alias(sensor) for sensor in sensors]
df.select(["id"] + cols)
+---+------------------+------------------+------------------+------------------+------------------+
| id| acceleration| speed| temp| torque| weight|
+---+------------------+------------------+------------------+------------------+------------------+
| a| null| null| 8.712929970154072|5.4881350392732475| 0.0|
| a| null| null| 2.021839744032572| 7.151893663724195| 1.0|
| a| null| null| 83.2619845547938| 6.027633760716439| 2.0|
| a| null| null| 77.81567509498505| 5.448831829968968| 3.0|
| a| null| null| 87.00121482468191| 4.236547993389047| 4.0|
| b| null| 97.8618342232764| 6.458941130666561| null| 5.0|
| b| null| 79.91585642167236| 4.375872112626925| null| 6.0|
| b| null|46.147936225293186| 8.917730007820797| null| 7.0|
| b| null| 78.05291762864555| 9.636627605010293| null| 8.0|
| b| null|11.827442586893323|3.8344151882577773| null| 9.0|
| c| 63.99210213275238| null| 10.0| null| 7.917250380826646|
| c| 14.33532874090464| null| 11.0| null| 5.288949197529044|
| c| 94.46689170495839| null| 12.0| null| 5.680445610939323|
| c|52.184832175007166| null| 13.0| null| 9.25596638292661|
| c| 41.46619399905236| null| 14.0| null|0.7103605819788694|
+---+------------------+------------------+------------------+------------------+------------------+
尽管效率较低,但也可以使用udf
:
from toolz import unique
from pyspark.sql.types import *
from pyspark.sql.functions import udf
channel_dict = (sensor_channel_df
.groupby("id")
.apply(lambda grp: dict(zip(grp["sensor"], grp["channel"])))
.to_dict())
def remap(d):
fields = sorted(unique(concat(_.keys() for _ in d.values())))
schema = StructType([StructField(f, DoubleType()) for f in fields])
def _(row, id):
return tuple(float(row[d[id].get(f)]) if d[id].get(f) is not None
else None for f in fields)
return udf(_, schema)
(df
.withColumn("vals", remap(channel_dict)(data_cols, "id"))
.select("id", "vals.*"))
+---+------------------+------------------+------------------+------------------+------------------+
| id| acceleration| speed| temp| torque| weight|
+---+------------------+------------------+------------------+------------------+------------------+
| a| null| null| 8.712929970154072|5.4881350392732475| 0.0|
| a| null| null| 2.021839744032572| 7.151893663724195| 1.0|
| a| null| null| 83.2619845547938| 6.027633760716439| 2.0|
| a| null| null| 77.81567509498505| 5.448831829968968| 3.0|
| a| null| null| 87.00121482468191| 4.236547993389047| 4.0|
| b| null| 97.8618342232764| 6.458941130666561| null| 5.0|
| b| null| 79.91585642167236| 4.375872112626925| null| 6.0|
| b| null|46.147936225293186| 8.917730007820797| null| 7.0|
| b| null| 78.05291762864555| 9.636627605010293| null| 8.0|
| b| null|11.827442586893323|3.8344151882577773| null| 9.0|
| c| 63.99210213275238| null| 10.0| null| 7.917250380826646|
| c| 14.33532874090464| null| 11.0| null| 5.288949197529044|
| c| 94.46689170495839| null| 12.0| null| 5.680445610939323|
| c|52.184832175007166| null| 13.0| null| 9.25596638292661|
| c| 41.46619399905236| null| 14.0| null|0.7103605819788694|
+---+------------------+------------------+------------------+------------------+------------------+
在 Spark 2.3 或更高版本中,您可以应用当前代码矢量化UDF https://stackoverflow.com/a/47497815/6910411.
* 为了理解这里发生了什么,让我们看一下单个组,由apply
:
grp = sensor_channel_df.groupby("id").get_group("a")
首先我们转换sensor
将传感器列转换为 Spark 文字序列Columns
(考虑常数值):
keys = list(map(lit, grp["sensor"]))
keys
Column<b'weight'>, Column<b'torque'>, Column<b'temp'>]
and sensor
Spark 的列到序列Columns
(考虑指向数据的指针):
values = list(map(col, grp["channel"]))
values
[Column<b'chan1'>, Column<b'chan2'>, Column<b'chan3'>]
当在上下文中评估时,前一个将导致恒定的输出:
df_ = df.drop_duplicates(subset=["id"])
df_.select(keys).show()
+------+------+----+
|weight|torque|temp|
+------+------+----+
|weight|torque|temp|
|weight|torque|temp|
|weight|torque|temp|
+------+------+----+
而后者会重复数据:
df_.select(values).show(3)
+-----+------------------+-----------------+
|chan1| chan2| chan3|
+-----+------------------+-----------------+
| 10| 7.917250380826646|63.99210213275238|
| 5| 6.458941130666561| 97.8618342232764|
| 0|5.4881350392732475|8.712929970154072|
+-----+------------------+-----------------+
接下来我们将这两个交错并组合成一个MapType
column:
mapping = create_map(*interleave([keys, values]))
mapping
Column<b'map(weight, chan1, torque, chan2, temp, chan3)'>
这为我们提供了从指标名称到数据列的映射(想想 Pythondict
),并且评估时:
df_.select(mapping).show(3, False)
+---------------------------------------------------------------------------+
|map(weight, chan1, torque, chan2, temp, chan3) |
+---------------------------------------------------------------------------+
|Map(weight -> 10.0, torque -> 7.917250380826646, temp -> 63.99210213275238)|
|Map(weight -> 5.0, torque -> 6.458941130666561, temp -> 97.8618342232764) |
|Map(weight -> 0.0, torque -> 5.4881350392732475, temp -> 8.712929970154072)|
+---------------------------------------------------------------------------+
最后,外部理解对所有组重复此操作,所以channel_map
is a Column
:
Column<b'map(a, map(weight, chan1, torque, chan2, temp, chan3), b, map(weight, chan1, temp, chan2, speed, chan3), c, map(temp, chan1, weight, chan2, acceleration, chan3))'>
评估给出以下结构:
df_.select(channel_map.alias("channel_map")).show(3, False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Map(a -> Map(weight -> 10.0, torque -> 7.917250380826646, temp -> 63.99210213275238), b -> Map(weight -> 10.0, temp -> 7.917250380826646, speed -> 63.99210213275238), c -> Map(temp -> 10.0, weight -> 7.917250380826646, acceleration -> 63.99210213275238))|
|Map(a -> Map(weight -> 5.0, torque -> 6.458941130666561, temp -> 97.8618342232764), b -> Map(weight -> 5.0, temp -> 6.458941130666561, speed -> 97.8618342232764), c -> Map(temp -> 5.0, weight -> 6.458941130666561, acceleration -> 97.8618342232764)) |
|Map(a -> Map(weight -> 0.0, torque -> 5.4881350392732475, temp -> 8.712929970154072), b -> Map(weight -> 0.0, temp -> 5.4881350392732475, speed -> 8.712929970154072), c -> Map(temp -> 0.0, weight -> 5.4881350392732475, acceleration -> 8.712929970154072))|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
最后我们使用id
要选择的列map
出于兴趣:
df_.select(channel_map[col("id")].alias("data_mapping")).show(3, False)
+---------------------------------------------------------------------------------+
|data_mapping |
+---------------------------------------------------------------------------------+
|Map(temp -> 10.0, weight -> 7.917250380826646, acceleration -> 63.99210213275238)|
|Map(weight -> 5.0, temp -> 6.458941130666561, speed -> 97.8618342232764) |
|Map(weight -> 0.0, torque -> 5.4881350392732475, temp -> 8.712929970154072) |
+---------------------------------------------------------------------------------+
和列名以从中提取值map
:
df_.select(channel_map[col("id")]["weight"].alias("weight")).show(3, False)
+-----------------+
|weight |
+-----------------+
|7.917250380826646|
|5.0 |
|0.0 |
+-----------------+
归根结底,这只是对包含符号表达式的数据结构的一系列简单转换。