Pyspark 组合不同长度的数据帧而不重复

2024-01-26

我有这三个 dfs:

id | name
------------------------
1  | {"value": "bob"}
1  | {"value": "Robert"}
2  | {"value": "Mary"}
id | dob
----------------------------
1  | {"value": "21-04-1988"}
2  | {"value": null}
id | country
--------------------
1  | {"value": "IT"}
1  | {"value": "DE"}
2  | {"value": "FR"}
2  | {"value": "ES"}

我想将它们结合起来,但我不想重复信息。

id | name                  | dob                     |country
----------------------------------------------------------------------
1  | {"value": "bob"}      | {"value": "21-04-1988"} | {"value": "IT"}
1  | {"value": "Robert"}   | Null                    | {"value": "DE"}
2  | {"value": "Mary"}     | {"value": Null}         | {"value": "FR"}
2  | Null                  | Null                    | {"value": "ES"}

我尝试使用多个外连接,但它不会产生上表。

name = spark.createDataFrame(
    [
        (1, {"value" : "bob"}),  # create your data here, be consistent in the types.
        (1, {"value" : "Robert"}),
        (2, {"value" : "Mary"})
    ],
    ["id", "name"]  # add your column names here
)

dob = spark.createDataFrame(
    [
        (1, {"value" : "21-04-1988"}),  # create your data here, be consistent in the types.
        (2, {"value" : None})
    ],
    ["id", "dob"]  # add your column names here
)

country = spark.createDataFrame(
    [
        (1, {"value" : "IT"}),  # create your data here, be consistent in the types.
        (1, {"value" : "DE"}),
        (2, {"value" : "FR"}),
        (2, {"value" : "ES"}),
    ],
    ["id", "country"]  # add your column names here
)


(name.join(dob, "id", "outer").join(country, "id", "outer")).show()

产生这个:

id  name                dob                     country
---------------------------------------------------------------
1 | {"value":"Robert"} |{"value":"21-04-1988"}  |{"value":"DE"}
1 | {"value":"Robert"} |{"value":"21-04-1988"}  |{"value":"IT"}
1 | {"value":"bob"}    |{"value":"21-04-1988"}  |{"value":"DE"}
1 | {"value":"bob"}    |{"value":"21-04-1988"}  |{"value":"IT"}
2 | {"value":"Mary"}   |{"value":null}          |{"value":"ES"}
2 | {"value":"Mary"}   |{"value":null}          |{"value":"FR"}

现在我明白这正是完整外连接的工作原理 - 但我不需要其中那些额外的重复信息(我需要包含尽可能多的行数)。

有什么线索吗?


您可以添加一列id2到所有三个数据帧使用row_number()例如然后将其与id作为连接条件:

from pyspark.sql import functions as F, Window

w = Window.partitionBy("id").orderBy(F.lit(None)) # change this if you have some column to use for ordering

name = name.withColumn("id2", F.row_number().over(w))
dob = dob.withColumn("id2", F.row_number().over(w))
country = country.withColumn("id2", F.row_number().over(w))

result = (name.join(dob, ["id", "rn"], "full")
          .join(country, ["id", "rn"], "full")
          .drop("rn")
          )

result.show(truncate=False)
#+---+-----------------+---------------------+-------------+
#|id |name             |dob                  |country      |
#+---+-----------------+---------------------+-------------+
#|1  |{value -> bob}   |{value -> 21-04-1988}|{value -> IT}|
#|1  |{value -> Robert}|null                 |{value -> DE}|
#|2  |{value -> Mary}  |{value -> null}      |{value -> FR}|
#|2  |null             |null                 |{value -> ES}|
#+---+-----------------+---------------------+-------------+
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

Pyspark 组合不同长度的数据帧而不重复 的相关文章

随机推荐