您可以使用lag获取前一个值的函数
如果您想按月份排序,则需要转换为正确的日期。为了"JAN-2017"
to "01-01-2017"
像这样的东西。
import spark.implicits._
val df = spark.sparkContext.parallelize(Seq(
("c1", "JAN-2017", 49),
("c1", "FEB-2017", 46),
("c1", "MAR-2017", 83),
("c2", "JAN-2017", 59),
("c2", "MAY-2017", 60),
("c2", "JUN-2017", 49),
("c2", "JUL-2017", 73)
)).toDF("city", "month", "sales")
val window = Window.partitionBy("city").orderBy("month")
df.withColumn("previous_sale", lag($"sales", 1, null).over(window)).show
Output:
+----+--------+-----+----+
|city| month|sales| previous_sale|
+----+--------+-----+----+
| c1|FEB-2017| 46|null|
| c1|JAN-2017| 49| 46|
| c1|MAR-2017| 83| 49|
| c2|JAN-2017| 59|null|
| c2|JUL-2017| 73| 59|
| c2|JUN-2017| 49| 73|
| c2|MAY-2017| 60| 49|
+----+--------+-----+----+
您可以使用此 UDF 创建一个默认日期,例如 01/月/年,即使年份不同,也会使用该日期对日期进行排序
val fullDate = udf((value :String )=>
{
val months = List("JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC")
val splited = value.split("-")
new Date(splited(1).toInt, months.indexOf(splited(0)) + 1, 1)
})
df.withColumn("month", fullDate($"month")).show()
希望这可以帮助!