我正在使用 Spark Streaming 1.5.2,并使用 Direct Stream 方法从 Kafka 0.8.2.2 中提取数据。
我已启用检查点,以便我的驱动程序可以重新启动并从中断处继续,而不会丢失未处理的数据。
检查点被写入 S3,因为我在 Amazon AWS 上而不是在 Hadoop 集群之上运行。
批处理间隔为 1 秒,因为我想要低延迟。
问题是,将单个检查点写入 S3 需要 1 到 20 秒。它们在内存中备份,最终应用程序失败。
2016-04-28 18:26:55,483 INFO [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882407000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882407000', took 6071 bytes and 1724 ms
2016-04-28 18:26:58,812 INFO [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882407000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882407000', took 6024 bytes and 3329 ms
2016-04-28 18:27:00,327 INFO [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882408000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882408000', took 6068 bytes and 1515 ms
2016-04-28 18:27:06,667 INFO [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882408000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882408000', took 6024 bytes and 6340 ms
2016-04-28 18:27:11,689 INFO [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882409000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882409000', took 6067 bytes and 5022 ms
2016-04-28 18:27:15,982 INFO [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882409000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882409000', took 6024 bytes and 4293 ms
有没有办法在不增加批次间隔的情况下增加检查点之间的间隔?