我用 SCALA 做了这个,所以你需要转换,但我认为以一种更简单的方式。我添加了一个键并在键级别执行了操作,您可以调整并聚合它。但原理要简单得多。不需要相关子查询。只是关系演算。用于日期等的数字。
// SCALA
// Slightly ambiguous on hols vs. weekend, as you stated treated as 1
import spark.implicits._
import org.apache.spark.sql.functions._
val dfE = Seq(
("NIC", 1, false, false),
("NIC", 2, false, false),
("NIC", 3, true, false),
("NIC", 4, true, true),
("NIC", 5, false, false),
("NIC", 6, false, false),
("XYZ", 1, false, true)
).toDF("e","d","w", "h")
//dfE.show(false)
val dfE2 = dfE.withColumn("wh", when ($"w" or $"h", 1) otherwise (0)).drop("w").drop("h")
//dfE2.show()
//Assuming more dfD's can exist
val dfD = Seq(
("NIC", 1, 4, "k1"),
("NIC", 2, 3, "k2"),
("NIC", 1, 1, "k3"),
("NIC", 7, 10, "k4")
).toDF("e","pd","dd", "k")
//dfD.show(false)
dfE2.createOrReplaceTempView("E2")
dfD.createOrReplaceTempView("D1")
// This done per record, if over identical keys, then strip k and aggr otherwise, I added k for checking each entry
// Point is it is far easier. Key means synthetic grouping by.
val q=sqlContext.sql(""" SELECT d1.k, d1.e, d1.pd, d1.dd, sum(e2.wh)
FROM D1, E2
WHERE D1.e = E2.e
AND E2.d >= D1.pd
AND E2.d <= D1.dd
GROUP BY d1.k, d1.e, d1.pd, d1.dd
ORDER BY d1.k, d1.e, d1.pd, d1.dd
""")
q.show
returns:
+---+---+---+---+-------+
| k| e| pd| dd|sum(wh)|
+---+---+---+---+-------+
| k1|NIC| 1| 4| 2|
| k2|NIC| 2| 3| 1|
| k3|NIC| 1| 1| 0|
+---+---+---+---+-------+
我认为可以进行简单的性能改进。事实上不需要相关的东西。
如果需要,可以在 D1.pd 和 D1.dd 之间使用 AND E2.d。