我也在 Python 中解决了这个问题,所以这里是 Ramesh 对 Python 的解决方案的一个端口:
df = spark.createDataFrame([(['Pear','Orange','Apple'], ['Chicken','Pork','Beef'])],
("Fruits", "Meat"))
df.show(1,False)
from pyspark.sql.functions import udf
mergeCols = udf(lambda fruits, meat: fruits + meat)
df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(1,False)
Output:
+---------------------+---------------------+
|Fruits |Meat |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+
+---------------------+---------------------+------------------------------------------+
|Fruits |Meat |Food |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+
向拉梅什致敬!
EDIT:请注意,您可能必须手动指定列类型(不确定为什么它仅在某些情况下没有明确的类型规范对我有用 - 在其他情况下我得到了字符串类型列)。
from pyspark.sql.types import *
mergeCols = udf(lambda fruits, meat: fruits + meat, ArrayType(StringType()))