为什么我的构建挂起/需要很长时间才能生成包含许多联合的查询计划？

2023-12-21

我注意到当我运行与示例相同的代码时here https://stackoverflow.com/questions/68474926/why-do-i-see-repeated-materializations-of-a-dataframe-in-my-build但有一个union or unionByName or unionAll而不是join，我的查询计划需要显著地更长，并可能导致驱动程序 OOM。

此处包含的代码仅供参考，与内部发生的内容略有不同for() loop.

from pyspark.sql import types as T, functions as F, SparkSession
spark = SparkSession.builder.getOrCreate()

schema = T.StructType([
  T.StructField("col_1", T.IntegerType(), False),
  T.StructField("col_2", T.IntegerType(), False),
  T.StructField("measure_1", T.FloatType(), False),
  T.StructField("measure_2", T.FloatType(), False),
])
data = [
  {"col_1": 1, "col_2": 2, "measure_1": 0.5, "measure_2": 1.5},
  {"col_1": 2, "col_2": 3, "measure_1": 2.5, "measure_2": 3.5}
]

df = spark.createDataFrame(data, schema)

right_schema = T.StructType([
  T.StructField("col_1", T.IntegerType(), False)
])
right_data = [
  {"col_1": 1},
  {"col_1": 1},
  {"col_1": 2},
  {"col_1": 2}
]
right_df = spark.createDataFrame(right_data, right_schema)

df = df.unionByName(df)
df = df.join(right_df, on="col_1")
df.show()

"""
+-----+-----+---------+---------+
|col_1|col_2|measure_1|measure_2|
+-----+-----+---------+---------+
|    1|    2|      0.5|      1.5|
|    1|    2|      0.5|      1.5|
|    1|    2|      0.5|      1.5|
|    1|    2|      0.5|      1.5|
|    2|    3|      2.5|      3.5|
|    2|    3|      2.5|      3.5|
|    2|    3|      2.5|      3.5|
|    2|    3|      2.5|      3.5|
+-----+-----+---------+---------+
"""

df.explain()

"""
== Physical Plan ==
*(6) Project [col_1#1800, col_2#1801, measure_1#1802, measure_2#1803]
+- *(6) SortMergeJoin [col_1#1800], [col_1#1808], Inner
   :- *(3) Sort [col_1#1800 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(col_1#1800, 200), ENSURE_REQUIREMENTS, [id=#5454]
   :     +- Union
   :        :- *(1) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
   :        +- *(2) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
   +- *(5) Sort [col_1#1808 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(col_1#1808, 200), ENSURE_REQUIREMENTS, [id=#5460]
         +- *(4) Scan ExistingRDD[col_1#1808]
"""

filter_union_cols = ["col_1", "measure_1", "col_2", "measure_2"]
df = df.withColumn("found_filter", F.lit(None))
for filter_col in filter_union_cols:
  stats = df.filter(F.col(filter_col) < F.lit(1)).drop("found_filter")
  df = df.unionByName(
    stats.select(
      "*",
      F.lit(filter_col).alias("found_filter")
    )
  )

df.show()

"""
+-----+-----+---------+---------+------------+                                  
|col_1|col_2|measure_1|measure_2|found_filter|
+-----+-----+---------+---------+------------+
|    1|    2|      0.5|      1.5|        null|
|    1|    2|      0.5|      1.5|        null|
|    1|    2|      0.5|      1.5|        null|
|    1|    2|      0.5|      1.5|        null|
|    2|    3|      2.5|      3.5|        null|
|    2|    3|      2.5|      3.5|        null|
|    2|    3|      2.5|      3.5|        null|
|    2|    3|      2.5|      3.5|        null|
|    1|    2|      0.5|      1.5|   measure_1|
|    1|    2|      0.5|      1.5|   measure_1|
|    1|    2|      0.5|      1.5|   measure_1|
|    1|    2|      0.5|      1.5|   measure_1|
+-----+-----+---------+---------+------------+
"""

df.explain()

# REALLY long query plan.....

"""
== Physical Plan ==
Union
:- *(6) Project [col_1#1800, col_2#1801, measure_1#1802, measure_2#1803, null AS found_filter#1855]
:  +- *(6) SortMergeJoin [col_1#1800], [col_1#1808], Inner
:     :- *(3) Sort [col_1#1800 ASC NULLS FIRST], false, 0
:     :  +- Exchange hashpartitioning(col_1#1800, 200), ENSURE_REQUIREMENTS, [id=#7637]
:     :     +- Union
:     :        :- *(1) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     :        +- *(2) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     +- *(5) Sort [col_1#1808 ASC NULLS FIRST], false, 0
:        +- Exchange hashpartitioning(col_1#1808, 200), ENSURE_REQUIREMENTS, [id=#7643]
:           +- *(4) Scan ExistingRDD[col_1#1808]
:- *(12) Project [col_1#1800, col_2#1801, measure_1#1802, measure_2#1803, col_1 AS found_filter#1860]
:  +- *(12) SortMergeJoin [col_1#1800], [col_1#1808], Inner
:     :- *(9) Sort [col_1#1800 ASC NULLS FIRST], false, 0
:     :  +- Exchange hashpartitioning(col_1#1800, 200), ENSURE_REQUIREMENTS, [id=#7654]
:     :     +- Union
:     :        :- *(7) Filter (col_1#1800 < 1)
:     :        :  +- *(7) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     :        +- *(8) Filter (col_1#1800 < 1)
:     :           +- *(8) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     +- *(11) Sort [col_1#1808 ASC NULLS FIRST], false, 0
:        +- Exchange hashpartitioning(col_1#1808, 200), ENSURE_REQUIREMENTS, [id=#7660]
:           +- *(10) Filter (col_1#1808 < 1)
:              +- *(10) Scan ExistingRDD[col_1#1808]
:- *(18) Project [col_1#1800, col_2#1801, measure_1#1802, measure_2#1803, measure_1 AS found_filter#1880]
:  +- *(18) SortMergeJoin [col_1#1800], [col_1#1808], Inner
:     :- *(15) Sort [col_1#1800 ASC NULLS FIRST], false, 0
:     :  +- Exchange hashpartitioning(col_1#1800, 200), ENSURE_REQUIREMENTS, [id=#7671]
:     :     +- Union
:     :        :- *(13) Filter (measure_1#1802 < 1.0)
:     :        :  +- *(13) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     :        +- *(14) Filter (measure_1#1802 < 1.0)
:     :           +- *(14) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     +- *(17) Sort [col_1#1808 ASC NULLS FIRST], false, 0
:        +- ReusedExchange [col_1#1808], Exchange hashpartitioning(col_1#1808, 200), ENSURE_REQUIREMENTS, [id=#7643]
:- *(24) Project [col_1#1800, col_2#1801, measure_1#1802, measure_2#1803, measure_1 AS found_filter#2022]
:  +- *(24) SortMergeJoin [col_1#1800], [col_1#1808], Inner
:     :- *(21) Sort [col_1#1800 ASC NULLS FIRST], false, 0
:     :  +- Exchange hashpartitioning(col_1#1800, 200), ENSURE_REQUIREMENTS, [id=#7688]
:     :     +- Union
:     :        :- *(19) Filter ((col_1#1800 < 1) AND (measure_1#1802 < 1.0))
:     :        :  +- *(19) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     :        +- *(20) Filter ((col_1#1800 < 1) AND (measure_1#1802 < 1.0))
:     :           +- *(20) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     +- *(23) Sort [col_1#1808 ASC NULLS FIRST], false, 0
:        +- ReusedExchange [col_1#1808], Exchange hashpartitioning(col_1#1808, 200), ENSURE_REQUIREMENTS, [id=#7660]
:- *(30) Project [col_1#1800, col_2#1801, measure_1#1802, measure_2#1803, col_2 AS found_filter#1900]
:  +- *(30) SortMergeJoin [col_1#1800], [col_1#1808], Inner
:     :- *(27) Sort [col_1#1800 ASC NULLS FIRST], false, 0
:     :  +- Exchange hashpartitioning(col_1#1800, 200), ENSURE_REQUIREMENTS, [id=#7705]
:     :     +- Union
:     :        :- *(25) Filter (col_2#1801 < 1)
:     :        :  +- *(25) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     :        +- *(26) Filter (col_2#1801 < 1)
:     :           +- *(26) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     +- *(29) Sort [col_1#1808 ASC NULLS FIRST], false, 0
:        +- ReusedExchange [col_1#1808], Exchange hashpartitioning(col_1#1808, 200), ENSURE_REQUIREMENTS, [id=#7643]
:- *(36) Project [col_1#1800, col_2#1801, measure_1#1802, measure_2#1803, col_2 AS found_filter#2023]
:  +- *(36) SortMergeJoin [col_1#1800], [col_1#1808], Inner
:     :- *(33) Sort [col_1#1800 ASC NULLS FIRST], false, 0
:     :  +- Exchange hashpartitioning(col_1#1800, 200), ENSURE_REQUIREMENTS, [id=#7722]
:     :     +- Union
:     :        :- *(31) Filter ((col_1#1800 < 1) AND (col_2#1801 < 1))
:     :        :  +- *(31) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     :        +- *(32) Filter ((col_1#1800 < 1) AND (col_2#1801 < 1))
:     :           +- *(32) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     +- *(35) Sort [col_1#1808 ASC NULLS FIRST], false, 0
:        +- ReusedExchange [col_1#1808], Exchange hashpartitioning(col_1#1808, 200), ENSURE_REQUIREMENTS, [id=#7660]
:- *(42) Project [col_1#1800, col_2#1801, measure_1#1802, measure_2#1803, col_2 AS found_filter#2024]
:  +- *(42) SortMergeJoin [col_1#1800], [col_1#1808], Inner
:     :- *(39) Sort [col_1#1800 ASC NULLS FIRST], false, 0
:     :  +- Exchange hashpartitioning(col_1#1800, 200), ENSURE_REQUIREMENTS, [id=#7739]
:     :     +- Union
:     :        :- *(37) Filter ((measure_1#1802 < 1.0) AND (col_2#1801 < 1))
:     :        :  +- *(37) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     :        +- *(38) Filter ((measure_1#1802 < 1.0) AND (col_2#1801 < 1))
:     :           +- *(38) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     +- *(41) Sort [col_1#1808 ASC NULLS FIRST], false, 0
:        +- ReusedExchange [col_1#1808], Exchange hashpartitioning(col_1#1808, 200), ENSURE_REQUIREMENTS, [id=#7643]
:- *(48) Project [col_1#1800, col_2#1801, measure_1#1802, measure_2#1803, col_2 AS found_filter#2028]
:  +- *(48) SortMergeJoin [col_1#1800], [col_1#1808], Inner
:     :- *(45) Sort [col_1#1800 ASC NULLS FIRST], false, 0
:     :  +- Exchange hashpartitioning(col_1#1800, 200), ENSURE_REQUIREMENTS, [id=#7756]
:     :     +- Union
:     :        :- *(43) Filter (((col_1#1800 < 1) AND (measure_1#1802 < 1.0)) AND (col_2#1801 < 1))
:     :        :  +- *(43) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     :        +- *(44) Filter (((col_1#1800 < 1) AND (measure_1#1802 < 1.0)) AND (col_2#1801 < 1))
:     :           +- *(44) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     +- *(47) Sort [col_1#1808 ASC NULLS FIRST], false, 0
:        +- ReusedExchange [col_1#1808], Exchange hashpartitioning(col_1#1808, 200), ENSURE_REQUIREMENTS, [id=#7660]
:- *(54) Project [col_1#1800, col_2#1801, measure_1#1802, measure_2#1803, measure_2 AS found_filter#1920]
:  +- *(54) SortMergeJoin [col_1#1800], [col_1#1808], Inner
:     :- *(51) Sort [col_1#1800 ASC NULLS FIRST], false, 0
:     :  +- Exchange hashpartitioning(col_1#1800, 200), ENSURE_REQUIREMENTS, [id=#7773]
:     :     +- Union
:     :        :- *(49) Filter (measure_2#1803 < 1.0)
:     :        :  +- *(49) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     :        +- *(50) Filter (measure_2#1803 < 1.0)
:     :           +- *(50) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     +- *(53) Sort [col_1#1808 ASC NULLS FIRST], false, 0
:        +- ReusedExchange [col_1#1808], Exchange hashpartitioning(col_1#1808, 200), ENSURE_REQUIREMENTS, [id=#7643]
:- *(60) Project [col_1#1800, col_2#1801, measure_1#1802, measure_2#1803, measure_2 AS found_filter#2025]
:  +- *(60) SortMergeJoin [col_1#1800], [col_1#1808], Inner
:     :- *(57) Sort [col_1#1800 ASC NULLS FIRST], false, 0
:     :  +- Exchange hashpartitioning(col_1#1800, 200), ENSURE_REQUIREMENTS, [id=#7790]
:     :     +- Union
:     :        :- *(55) Filter ((col_1#1800 < 1) AND (measure_2#1803 < 1.0))
:     :        :  +- *(55) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     :        +- *(56) Filter ((col_1#1800 < 1) AND (measure_2#1803 < 1.0))
:     :           +- *(56) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     +- *(59) Sort [col_1#1808 ASC NULLS FIRST], false, 0
:        +- ReusedExchange [col_1#1808], Exchange hashpartitioning(col_1#1808, 200), ENSURE_REQUIREMENTS, [id=#7660]
:- *(66) Project [col_1#1800, col_2#1801, measure_1#1802, measure_2#1803, measure_2 AS found_filter#2026]
:  +- *(66) SortMergeJoin [col_1#1800], [col_1#1808], Inner
:     :- *(63) Sort [col_1#1800 ASC NULLS FIRST], false, 0
:     :  +- Exchange hashpartitioning(col_1#1800, 200), ENSURE_REQUIREMENTS, [id=#7807]
:     :     +- Union
:     :        :- *(61) Filter ((measure_1#1802 < 1.0) AND (measure_2#1803 < 1.0))
:     :        :  +- *(61) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     :        +- *(62) Filter ((measure_1#1802 < 1.0) AND (measure_2#1803 < 1.0))
:     :           +- *(62) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     +- *(65) Sort [col_1#1808 ASC NULLS FIRST], false, 0
:        +- ReusedExchange [col_1#1808], Exchange hashpartitioning(col_1#1808, 200), ENSURE_REQUIREMENTS, [id=#7643]
:- *(72) Project [col_1#1800, col_2#1801, measure_1#1802, measure_2#1803, measure_2 AS found_filter#2029]
:  +- *(72) SortMergeJoin [col_1#1800], [col_1#1808], Inner
:     :- *(69) Sort [col_1#1800 ASC NULLS FIRST], false, 0
:     :  +- Exchange hashpartitioning(col_1#1800, 200), ENSURE_REQUIREMENTS, [id=#7824]
:     :     +- Union
:     :        :- *(67) Filter (((col_1#1800 < 1) AND (measure_1#1802 < 1.0)) AND (measure_2#1803 < 1.0))
:     :        :  +- *(67) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     :        +- *(68) Filter (((col_1#1800 < 1) AND (measure_1#1802 < 1.0)) AND (measure_2#1803 < 1.0))
:     :           +- *(68) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     +- *(71) Sort [col_1#1808 ASC NULLS FIRST], false, 0
:        +- ReusedExchange [col_1#1808], Exchange hashpartitioning(col_1#1808, 200), ENSURE_REQUIREMENTS, [id=#7660]
:- *(78) Project [col_1#1800, col_2#1801, measure_1#1802, measure_2#1803, measure_2 AS found_filter#2027]
:  +- *(78) SortMergeJoin [col_1#1800], [col_1#1808], Inner
:     :- *(75) Sort [col_1#1800 ASC NULLS FIRST], false, 0
:     :  +- Exchange hashpartitioning(col_1#1800, 200), ENSURE_REQUIREMENTS, [id=#7841]
:     :     +- Union
:     :        :- *(73) Filter ((col_2#1801 < 1) AND (measure_2#1803 < 1.0))
:     :        :  +- *(73) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     :        +- *(74) Filter ((col_2#1801 < 1) AND (measure_2#1803 < 1.0))
:     :           +- *(74) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     +- *(77) Sort [col_1#1808 ASC NULLS FIRST], false, 0
:        +- ReusedExchange [col_1#1808], Exchange hashpartitioning(col_1#1808, 200), ENSURE_REQUIREMENTS, [id=#7643]
:- *(84) Project [col_1#1800, col_2#1801, measure_1#1802, measure_2#1803, measure_2 AS found_filter#2030]
:  +- *(84) SortMergeJoin [col_1#1800], [col_1#1808], Inner
:     :- *(81) Sort [col_1#1800 ASC NULLS FIRST], false, 0
:     :  +- Exchange hashpartitioning(col_1#1800, 200), ENSURE_REQUIREMENTS, [id=#7858]
:     :     +- Union
:     :        :- *(79) Filter (((col_1#1800 < 1) AND (col_2#1801 < 1)) AND (measure_2#1803 < 1.0))
:     :        :  +- *(79) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     :        +- *(80) Filter (((col_1#1800 < 1) AND (col_2#1801 < 1)) AND (measure_2#1803 < 1.0))
:     :           +- *(80) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     +- *(83) Sort [col_1#1808 ASC NULLS FIRST], false, 0
:        +- ReusedExchange [col_1#1808], Exchange hashpartitioning(col_1#1808, 200), ENSURE_REQUIREMENTS, [id=#7660]
:- *(90) Project [col_1#1800, col_2#1801, measure_1#1802, measure_2#1803, measure_2 AS found_filter#2031]
:  +- *(90) SortMergeJoin [col_1#1800], [col_1#1808], Inner
:     :- *(87) Sort [col_1#1800 ASC NULLS FIRST], false, 0
:     :  +- Exchange hashpartitioning(col_1#1800, 200), ENSURE_REQUIREMENTS, [id=#7875]
:     :     +- Union
:     :        :- *(85) Filter (((measure_1#1802 < 1.0) AND (col_2#1801 < 1)) AND (measure_2#1803 < 1.0))
:     :        :  +- *(85) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     :        +- *(86) Filter (((measure_1#1802 < 1.0) AND (col_2#1801 < 1)) AND (measure_2#1803 < 1.0))
:     :           +- *(86) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
:     +- *(89) Sort [col_1#1808 ASC NULLS FIRST], false, 0
:        +- ReusedExchange [col_1#1808], Exchange hashpartitioning(col_1#1808, 200), ENSURE_REQUIREMENTS, [id=#7643]
+- *(96) Project [col_1#1800, col_2#1801, measure_1#1802, measure_2#1803, measure_2 AS found_filter#2032]
   +- *(96) SortMergeJoin [col_1#1800], [col_1#1808], Inner
      :- *(93) Sort [col_1#1800 ASC NULLS FIRST], false, 0
      :  +- Exchange hashpartitioning(col_1#1800, 200), ENSURE_REQUIREMENTS, [id=#7892]
      :     +- Union
      :        :- *(91) Filter ((((col_1#1800 < 1) AND (measure_1#1802 < 1.0)) AND (col_2#1801 < 1)) AND (measure_2#1803 < 1.0))
      :        :  +- *(91) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
      :        +- *(92) Filter ((((col_1#1800 < 1) AND (measure_1#1802 < 1.0)) AND (col_2#1801 < 1)) AND (measure_2#1803 < 1.0))
      :           +- *(92) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
      +- *(95) Sort [col_1#1808 ASC NULLS FIRST], false, 0
         +- ReusedExchange [col_1#1808], Exchange hashpartitioning(col_1#1808, 200), ENSURE_REQUIREMENTS, [id=#7660]
"""

我在这里看到一个明显更长的查询计划，特别是随着for()循环增加，性能急剧下降。

我怎样才能提高我的表现？

这是 Spark 中迭代算法的一个已知限制。目前，循环的每次迭代都会导致内部节点重新评估并堆叠在外部节点上df多变的。

这意味着您的查询规划过程正在进行O(exp(n))其中 n 是循环的迭代次数。

Palantir Foundry 中有一个名为 Transforms Verbs 的工具可以帮助解决这个问题。

只需导入即可transforms.verbs.dataframes.union_many并根据您希望实现的数据帧的总集调用它（假设您的逻辑允许这样做，即循环的一次迭代不依赖于循环的前一迭代的结果。

上面的代码应该修改为：

from pyspark.sql import types as T, functions as F, SparkSession
from transforms.verbs.dataframes import union_many

spark = SparkSession.builder.getOrCreate()

schema = T.StructType([
  T.StructField("col_1", T.IntegerType(), False),
  T.StructField("col_2", T.IntegerType(), False),
  T.StructField("measure_1", T.FloatType(), False),
  T.StructField("measure_2", T.FloatType(), False),
])
data = [
  {"col_1": 1, "col_2": 2, "measure_1": 0.5, "measure_2": 1.5},
  {"col_1": 2, "col_2": 3, "measure_1": 2.5, "measure_2": 3.5}
]

df = spark.createDataFrame(data, schema)

right_schema = T.StructType([
  T.StructField("col_1", T.IntegerType(), False)
])
right_data = [
  {"col_1": 1},
  {"col_1": 1},
  {"col_1": 2},
  {"col_1": 2}
]
right_df = spark.createDataFrame(right_data, right_schema)

df = df.unionByName(df)
df = df.join(right_df, on="col_1")
df.show()

"""
+-----+-----+---------+---------+
|col_1|col_2|measure_1|measure_2|
+-----+-----+---------+---------+
|    1|    2|      0.5|      1.5|
|    1|    2|      0.5|      1.5|
|    1|    2|      0.5|      1.5|
|    1|    2|      0.5|      1.5|
|    2|    3|      2.5|      3.5|
|    2|    3|      2.5|      3.5|
|    2|    3|      2.5|      3.5|
|    2|    3|      2.5|      3.5|
+-----+-----+---------+---------+
"""

df.explain()

"""
== Physical Plan ==
*(6) Project [col_1#1800, col_2#1801, measure_1#1802, measure_2#1803]
+- *(6) SortMergeJoin [col_1#1800], [col_1#1808], Inner
   :- *(3) Sort [col_1#1800 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(col_1#1800, 200), ENSURE_REQUIREMENTS, [id=#5454]
   :     +- Union
   :        :- *(1) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
   :        +- *(2) Scan ExistingRDD[col_1#1800,col_2#1801,measure_1#1802,measure_2#1803]
   +- *(5) Sort [col_1#1808 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(col_1#1808, 200), ENSURE_REQUIREMENTS, [id=#5460]
         +- *(4) Scan ExistingRDD[col_1#1808]
"""

filter_union_cols = ["col_1", "measure_1", "col_2", "measure_2"]
df = df.withColumn("found_filter", F.lit(None))
union_dfs = []
for filter_col in filter_union_cols:
  stats = df.filter(F.col(filter_col) < F.lit(1)).drop("found_filter")
  union_df = stats.select(
    "*",
    F.lit(filter_col).alias("found_filter")
  )
  union_dfs += [union_df]

df = df.unionByName(
  union_many(union_dfs)
)

这将优化您的工会并显着减少时间。

底线：谨防使用任何union在 for/while 循环内部调用。如果您必须使用此行为，请使用transforms.verbs.dataframes.union_many动词来优化您的最终数据帧集

查看您的平台文档以获取更多信息和更有用的动词。

专业提示：使用包含的优化here https://stackoverflow.com/questions/68474926/why-do-i-see-repeated-materializations-of-a-dataframe-in-my-build进一步提高您的表现

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

为什么我的构建挂起/需要很长时间才能生成包含许多联合的查询计划？的相关文章

使用 PySpark 从 azure blob 存储读取 csv 文件

我正在尝试使用 Microsoft Azure 上的 PySpark HDInsight 集群来做一个机器学习项目要在我的集群上进行操作请使用 Jupyter 笔记本另外我的数据一个 csv 文件存储在 Azure Blob 存
从 pySpark 中的字典构建一行

我正在尝试在 pySpark 1 6 1 中动态构建一行然后将其构建到数据帧中总体思路是扩展结果describe例如包括偏斜和峰度这是我认为应该起作用的 from pyspark sql import Row row dict C0
使用 pyspark awsglue 时显示 DataFrame

如何使用 awsglue 的 job etl 显示 DataFrame 我尝试了下面的代码但没有显示任何内容 df show code datasource0 glueContext create dynamic frame from c
我可以使用 dask 创建 multivariate_normal 矩阵吗？

有点相关这个帖子 https stackoverflow com questions 52337612 random multivariate normal on a dask array 我正在尝试复制multivariate norma
Pyspark - 一次聚合数据帧的所有列[重复]

这个问题在这里已经有答案了我想将数据框分组到单个列上然后对所有列应用聚合函数例如我有一个包含 10 列的 df 我希望对第一列 1 进行分组然后对所有剩余列均为数字应用聚合函数 sum 与此等效的 R 是 summarise
如何在 pySpark 数据框中添加行 ID [重复]

这个问题在这里已经有答案了我有一个 csv 文件我在 pyspark 中将其转换为 DataFrame df 经过一番改造后我想在 df 中添加一列这应该是简单的行 ID 从 0 或 1 开始到 N 我将 df 转换为 rdd 并使
指定 Parquet 属性 pyspark

如何在 PySpark 中指定 Parquet 块大小和页面大小我到处搜索但找不到任何有关函数调用或导入库的文档根据火花用户档案 https mail archives apache org mod mbox spark user 2
从 Spark 数据帧中过滤大量 ID

我有一个大型数据框其格式类似于 ID Cat date 12 A 201602 14 B 201601 19 A 201608 12 F 201605 11 G 201603 我需要根据大约 500 万个 Is 的列表来过滤行最直接的方
SparkSession 初始化需要很长时间

SparkSession 初始化需要很长时间才能成功这是我的代码 import findspark findspark init import pyspark from pyspark sql import SparkSession sp
使用 Glue 将数据输入到 AWS Elastic Search

我正在寻找使用 AWS Glue python 或 pyspark 将数据插入 AWS Elastic Search 的解决方案我见过用于 Elastic Search 的 Boto3 SDK 但找不到任何将数据插入 Elastic Se
我如何判断我的 Spark 工作是否有进展？

我有一个正在运行的 Spark 作业YARN它似乎只是挂起并且没有进行任何计算这是当我这样做时纱线所说的yarn application status
使用 Spark pandas_udf 创建列，具有动态数量的输入列

我有这个 df df spark createDataFrame row a 5 0 0 0 11 0 row b 3394 0 0 0 4543 0 row c 136111 0 0 0 219255 0 row d 0 0 0 0 0
获取：导入 Spark 模块时出错：没有名为“pyspark.streaming.kafka”的模块

我需要将从 pyspark 脚本创建的日志推送到 kafka 我正在做 POC 所以在 Windows 机器上使用 Kafka 二进制文件我的版本是 kafka 2 4 0 spark 3 0 和 python 3 8 1 我正在使用 p
如何在Spark结构化流中指定批处理间隔？

我正在使用 Spark 结构化流并遇到问题在 StreamingContext DStreams 中我们可以定义批处理间隔如下所示 from pyspark streaming import StreamingContext ssc
Pyspark：相当于 np.where [重复]

这个问题在这里已经有答案了这个操作在 Pyspark 中相当于什么 import pandas as pd import numpy as np df pd DataFrame Type list ABBC Set list ZZXY d
如何使用 PySpark 有效地将这么多 csv 文件（大约 130,000 个）合并到一个大型数据集中？

我之前发布了这个问题并得到了一些使用 PySpark 的建议如何有效地将这一大数据集合并到一个大数据框中 https stackoverflow com questions 60259271 how can i merge this la
如何找到特定 Spark 配置属性的值？

如何在我的 Spark 代码中找到 Spark 配置的值例如我想找到spark sql shuffle partitions的值并在我的代码中引用它以下代码将返回所有值 spark sparkContext getConf getAl
Spark：连接两个相同分区的数据帧时防止洗牌/交换

我有两个数据框df1 and df2我想在一个名为的高基数字段上多次加入这些表visitor id 我只想执行一次初始洗牌并让所有连接发生而无需在 Spark 执行器之间洗牌交换数据为此我创建了另一个名为visitor parti
在 Pandas UDF PySpark 中传递多列

我想计算 PySpark DataFrame 两列之间的 Jaro Winkler 距离 Jaro Winkler 距离可通过所有节点上的 pyjarowinkler 包获得 pyjarowinkler 的工作原理如下 from pyjar
SQL 类似于 PySpark 数据帧的 NOT IN 子句

例如在 SQL 中我们可以这样做select from table where col1 not in A B 我想知道是否有一个与此等效的 PySpark 我能够找到isin类似于 SQL 的函数IN条款但没有任何内容NOT IN

随机推荐

减少“pixmap”时如何自动将 PySide6 应用程序调整为最小大小？

我使用以下代码来设置PySide6应用程序的大小尽可能小当增加小部件的大小时这可以很好地工作设置 to 在第 37 行但在减小它时则不然实际上窗口的大小does减少但似乎晚了一步我找到了一些解决方法最值得注意的是Qt 布局
如何在 ImageView 中单击时全屏显示图像？

我需要在 ImageView 中单击时全屏打开一张图像就像带有一张图像的画廊一样我该怎么做呢这是一个基本示例布局activity main xml
如何在不使用任何 View 类的情况下将分页库与 Room Dao 一起使用

我有一个带有类的 android 服务没有 UI FooManager需要获取所有Foo数据库表中的对象行 foo使用 room 和 androidx 分页库我发现的所有示例都展示了如何使用 PagedListAdapter 和 Re
安装 anaconda3 后，黑色格式化程序在 VSCode 中不起作用

设置 json python pythonPath Users brandonwie opt anaconda3 bin python python editor tabSize 4 python languageServer Pylanc
将 Apache Hadoop 数据输出存储到 Mysql 数据库

我需要将map reduce程序的输出存储到数据库中有什么办法吗如果是这样是否可以根据要求将输出存储到多个列和表中请给我建议一些解决方案谢谢展示了一个很好的例子在这个博客上 http archanaschangale wordp
对数据进行数字签名，而不是对文档进行数字签名[关闭]

Closed 这个问题是无关 help closed questions 目前不接受答案什么构成了合法的数字签名web form 不是文件 OPTION 1 我参与了一个项目医生记录患者的健康状况提交 Web 表单后将生成 PD
Mongodb：如何索引多个嵌套文本字段？

我的数据库中有以下类型的 json id 519817e508a16b447c00020e keyword Just an example query results 1 base domain example1 com href http
为什么 Windows 给出 sqlite3.OperationalError 而 linux 没有？

问题我有一个程序使用风暴0 14 http storm canonical com它在 Windows 上给了我这个错误 sqlite3 OperationError database table is locked 问题是在linux
在 Objective-c 中，我们如何拥有一个作用域为整个类（但不包括子类）的变量

我创建了大量的 UITableViewCell 子类其中每个都是附加 BGBaseTableViewCell 类的子类当被问及高度时我认为告诉用户 UITableViewCell 的典型尺寸是多少会很有用一种方法是指定两次第一次是
Spring Data REST 的 QueryDSL 集成可以用于执行更复杂的查询吗？

我目前正在构建一个 REST API 希望客户端能够轻松过滤特定实体的大多数属性使用QueryDSL http querydsl com 结合春季数据休息 http projects spring io spring data rest
如何从终端打开 Xcode？

我注意到我当前的 bash 文件有export PATH PATH Applications MAMP library bin 我把它放在那里是为了设置对 mamp 的终端访问我一直在尝试编译 a MyApp a xcodeproj op
为什么改变joinColumn的顺序，hibernate会返回正确或不正确的查询？

PROBLEM 为什么改变joinColumn的顺序 hibernate会返回正确或不正确的查询首次加入 OneToOne JoinTable name TTR POA UPA joinColumns JoinColumn name AN
使用 AJAX 删除多个条目

我可以使用 AJAX 一次删除一个条目现在我可以使用复选框选择条目并同时删除它们这在没有 AJAX 的情况下也能很好地工作现在我尝试使用 AJAX 删除多个条目这是我到目前为止所拥有的阿贾克斯 function checkbox
MS Test 从项目构建输出路径随机执行

在调查一个问题时如果使用测试资源管理器中的全部运行某些单元测试会失败我发现如果单独运行或全部的其他子集运行它们成功了因为他们是未部署到新的 test Out 文件夹在调试模块窗口中进行验证测试失败的问题是缺少组件我设法解决
jqGrid 自定义 onCellSelect 在SaveCell 之前不触发

我正在里面写一些代码单元格选择执行得很好 onCellSelect function rowid iCol cellcontent if iCol gt 0 gridMain d jqGrid resetSelection gridMai
未处理的异常编译错误：ClassNotFoundException

我知道这是一个常见问题但尽管添加mysql connector对于我在 IntelliJ IDEA 中的库 IDE 似乎仍然无法找到该类正如您在下面的屏幕截图中看到的我添加了mysql connectorjar 作为我的项目的库但似
Sublime 找不到 Node.js

我在另一个 Sublime 软件包中也遇到过同样的问题我的 Mac 上有些东西配置不正确我安装了 CSSCombnpm install g csscomb 我使用 Sublime 的 Package Installer 安装了 CssC
Windows Phone C# 中的 XML 反序列化

我正在开发新闻门户的 WP 应用程序我需要解析各种提要标准提要如下所示
如何在类构造函数中绑定对堆栈分配类的引用？

我有这门课 class HumanA public HumanA std string name Weapon weapon HumanA void inline void attack void std cout lt lt name l
为什么我的构建挂起/需要很长时间才能生成包含许多联合的查询计划？

我注意到当我运行与示例相同的代码时here https stackoverflow com questions 68474926 why do i see repeated materializations of a dataframe i

为什么我的构建挂起/需要很长时间才能生成包含许多联合的查询计划？

为什么我的构建挂起/需要很长时间才能生成包含许多联合的查询计划？ 的相关文章

随机推荐

热门标签

为什么我的构建挂起/需要很长时间才能生成包含许多联合的查询计划？的相关文章