Spark 结构化流是否可以实现正确的事件时间会话?

2024-03-15

一直在玩 Spark Structured Streaming 和mapGroupsWithState(具体如下结构化会话化 https://github.com/apache/spark/blob/v2.3.1/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scalaSpark 源代码中的示例)。我想确认一些我认为存在的限制mapGroupsWithState鉴于我的用例。

出于我的目的,会话是用户的一组不间断活动,这样两个按时间顺序(按事件时间,而不是处理时间)排序的事件之间的间隔不会超过开发人员定义的某个持续时间(通常为 30 分钟)。

在进入代码之前,一个示例会有所帮助:

{"event_time": "2018-01-01T00:00:00", "user_id": "mike"}
{"event_time": "2018-01-01T00:01:00", "user_id": "mike"}
{"event_time": "2018-01-01T00:05:00", "user_id": "mike"}
{"event_time": "2018-01-01T00:45:00", "user_id": "mike"}

对于上面的流,会话定义为 30 分钟不活动时间。在流式传输环境中,我们应该以一个会话结束(第二个会话尚未完成):

[
  {
    "user_id": "mike",
    "startTimestamp": "2018-01-01T00:00:00",
    "endTimestamp": "2018-01-01T00:05:00"
  }
]

现在考虑以下 Spark 驱动程序:

import java.sql.Timestamp

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.execution.streaming.MemoryStream
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.{GroupState, GroupStateTimeout}

object StructuredSessionizationV2 {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .master("local[2]")
      .appName("StructredSessionizationRedux")
      .getOrCreate()
    spark.sparkContext.setLogLevel("WARN")
    import spark.implicits._

    implicit val ctx = spark.sqlContext
    val input = MemoryStream[String]

    val EVENT_SCHEMA = new StructType()
      .add($"event_time".string)
      .add($"user_id".string)

    val events = input.toDS()
      .select(from_json($"value", EVENT_SCHEMA).alias("json"))
      .select($"json.*")
      .withColumn("event_time", to_timestamp($"event_time"))
      .withWatermark("event_time", "1 hours")
    events.printSchema()

    val sessionized = events
      .groupByKey(row => row.getAs[String]("user_id"))
      .mapGroupsWithState[SessionState, SessionOutput](GroupStateTimeout.EventTimeTimeout) {
      case (userId: String, events: Iterator[Row], state: GroupState[SessionState]) =>
        println(s"state update for user ${userId} (current watermark: ${new Timestamp(state.getCurrentWatermarkMs())})")
        if (state.hasTimedOut) {
          println(s"User ${userId} has timed out, sending final output.")
          val finalOutput = SessionOutput(
            userId = userId,
            startTimestampMs = state.get.startTimestampMs,
            endTimestampMs = state.get.endTimestampMs,
            durationMs = state.get.durationMs,
            expired = true
          )
          // Drop this user's state
          state.remove()
          finalOutput
        } else {
          val timestamps = events.map(_.getAs[Timestamp]("event_time").getTime).toSeq
          println(s"User ${userId} has new events (min: ${new Timestamp(timestamps.min)}, max: ${new Timestamp(timestamps.max)}).")
          val newState = if (state.exists) {
            println(s"User ${userId} has existing state.")
            val oldState = state.get
            SessionState(
              startTimestampMs = math.min(oldState.startTimestampMs, timestamps.min),
              endTimestampMs = math.max(oldState.endTimestampMs, timestamps.max)
            )
          } else {
            println(s"User ${userId} has no existing state.")
            SessionState(
              startTimestampMs = timestamps.min,
              endTimestampMs = timestamps.max
            )
          }
          state.update(newState)
          state.setTimeoutTimestamp(newState.endTimestampMs, "30 minutes")
          println(s"User ${userId} state updated. Timeout now set to ${new Timestamp(newState.endTimestampMs + (30 * 60 * 1000))}")
          SessionOutput(
            userId = userId,
            startTimestampMs = state.get.startTimestampMs,
            endTimestampMs = state.get.endTimestampMs,
            durationMs = state.get.durationMs,
            expired = false
          )
        }
      }

    val eventsQuery = sessionized
      .writeStream
      .queryName("events")
      .outputMode("update")
      .format("console")
      .start()

    input.addData(
      """{"event_time": "2018-01-01T00:00:00", "user_id": "mike"}""",
      """{"event_time": "2018-01-01T00:01:00", "user_id": "mike"}""",
      """{"event_time": "2018-01-01T00:05:00", "user_id": "mike"}"""
    )
    input.addData(
      """{"event_time": "2018-01-01T00:45:00", "user_id": "mike"}"""
    )
    eventsQuery.processAllAvailable()
  }

  case class SessionState(startTimestampMs: Long, endTimestampMs: Long) {
    def durationMs: Long = endTimestampMs - startTimestampMs
  }

  case class SessionOutput(userId: String, startTimestampMs: Long, endTimestampMs: Long, durationMs: Long, expired: Boolean)
}

该程序的输出是:

root
 |-- event_time: timestamp (nullable = true)
 |-- user_id: string (nullable = true)

state update for user mike (current watermark: 1969-12-31 19:00:00.0)
User mike has new events (min: 2018-01-01 00:00:00.0, max: 2018-01-01 00:05:00.0).
User mike has no existing state.
User mike state updated. Timeout now set to 2018-01-01 00:35:00.0
-------------------------------------------
Batch: 0
-------------------------------------------
+------+----------------+--------------+----------+-------+
|userId|startTimestampMs|endTimestampMs|durationMs|expired|
+------+----------------+--------------+----------+-------+
|  mike|   1514782800000| 1514783100000|    300000|  false|
+------+----------------+--------------+----------+-------+

state update for user mike (current watermark: 2017-12-31 23:05:00.0)
User mike has new events (min: 2018-01-01 00:45:00.0, max: 2018-01-01 00:45:00.0).
User mike has existing state.
User mike state updated. Timeout now set to 2018-01-01 01:15:00.0
-------------------------------------------
Batch: 1
-------------------------------------------
+------+----------------+--------------+----------+-------+
|userId|startTimestampMs|endTimestampMs|durationMs|expired|
+------+----------------+--------------+----------+-------+
|  mike|   1514782800000| 1514785500000|   2700000|  false|
+------+----------------+--------------+----------+-------+

根据我的会话定义,第二批中的单个事件should触发会话状态到期,从而触发新会话。然而,由于水印(2017-12-31 23:05:00.0)尚未超过状态超时时间(2018-01-01 00:35:00.0),状态未过期,并且事件被错误地添加到现有会话中,尽管自上一批中的最新时间戳以来已经过去了 30 多分钟。

我认为会话状态过期的唯一方法是,如果在批次中收到足够多的来自不同用户的事件,以将水印提前超过状态超时时间mike.

我想人们也可能会弄乱流的水印,但我想不出如何做到这一点来完成我的用例。

这准确吗?我是否遗漏了如何在 Spark 中正确执行基于事件时间的会话化的任何内容?


如果水印间隔大于会话间隙持续时间,您提供的实现似乎不起作用。

对于您所展示的逻辑,您需要将水印间隔设置为

如果您确实希望水印间隔独立于(或大于)会话间隙持续时间,则需要等到水印经过(水印+间隙)才能使状态过期。合并逻辑似乎是盲目地合并窗口。这应该在合并之前考虑间隙持续时间。

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

Spark 结构化流是否可以实现正确的事件时间会话? 的相关文章

随机推荐