您可以明确标记DataFrame
小到足以进行广播
使用broadcast
功能:
Python:
from pyspark.sql.functions import broadcast
small_df = ...
large_df = ...
large_df.join(broadcast(small_df), ["foo"])
或广播提示(Spark >= 2.2):
large_df.join(small_df.hint("broadcast"), ["foo"])
Scala:
import org.apache.spark.sql.functions.broadcast
val smallDF: DataFrame = ???
val largeDF: DataFrame = ???
largeDF.join(broadcast(smallDF), Seq("foo"))
或广播提示(Spark >= 2.2):
largeDF.join(smallDF.hint("broadcast"), Seq("foo"))
SQL
您可以使用提示(火花 >= 2.2 https://issues.apache.org/jira/browse/SPARK-16475):
SELECT /*+ MAPJOIN(small) */ *
FROM large JOIN small
ON large.foo = small.foo
or
SELECT /*+ BROADCASTJOIN(small) */ *
FROM large JOIN small
ON large.foo = small.foo
or
SELECT /*+ BROADCAST(small) */ *
FROM large JOIN small
ON larger.foo = small.foo
R(火花R):
With hint
(火花 >= 2.2):
join(large, hint(small, "broadcast"), large$foo == small$foo)
With broadcast
(火花 >= 2.3)
join(large, broadcast(small), large$foo == small$foo)
Note:
如果其中一个结构相对较小,则广播连接很有用。否则它可能比完全洗牌要贵得多。