具有数据流的 Apache Beam - 从 BigQuery 读取时出现空指针

2024-03-06

我正在使用 apache beam 编写的 google 数据流上运行一项作业,该作业从 BigQuery 表和文件中读取。转换数据并将其写入其他 BigQuery 表中。作业“通常”会成功,但有时在从大查询表读取数据时会随机出现空指针异常,并且作业失败:

(288abb7678892196): java.lang.NullPointerException
at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:98)
at com.google.cloud.dataflow.worker.runners.worker.WorkerCustomSources.splitAndValidate(WorkerCustomSources.java:261)
at com.google.cloud.dataflow.worker.runners.worker.WorkerCustomSources.performSplitTyped(WorkerCustomSources.java:209)
at com.google.cloud.dataflow.worker.runners.worker.WorkerCustomSources.performSplitWithApiLimit(WorkerCustomSources.java:184)
at com.google.cloud.dataflow.worker.runners.worker.WorkerCustomSources.performSplit(WorkerCustomSources.java:161)
at com.google.cloud.dataflow.worker.runners.worker.WorkerCustomSourceOperationExecutor.execute(WorkerCustomSourceOperationExecutor.java:47)
at com.google.cloud.dataflow.worker.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:341)
at com.google.cloud.dataflow.worker.runners.worker.DataflowWorker.doWork(DataflowWorker.java:297)
at com.google.cloud.dataflow.worker.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:244)
at com.google.cloud.dataflow.worker.runners.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:125)
at com.google.cloud.dataflow.worker.runners.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:105)
at com.google.cloud.dataflow.worker.runners.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:92)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

我无法弄清楚这与什么有关。当我清除临时目录并重新上传模板时,作业再次通过。

我从 BQ 读取的方式很简单:

BigQueryIO.read().fromQuery()

我将非常感谢任何帮助。

Anyone?


我最终在谷歌问题跟踪器中添加了错误。 经过与 Google 员工的长时间交谈和调查后发现,将模板与从 BigQuery 读取的数据流批处理作业一起使用是没有意义的,因为您只能执行它们一次。

引用:“对于 BigQuery 批处理管道,模板只能执行一次,因为 BigQuery 作业 ID 是在模板创建时设置的。此限制将在 SDK 2 的未来版本中删除,但我不能说。 创建模板:https://cloud.google.com/dataflow/docs/templates/creating-templates#pipeline-io-and-runtime-parameters https://cloud.google.com/dataflow/docs/templates/creating-templates#pipeline-io-and-runtime-parameters"

如果错误能比 NullpointerException 更清晰的话那就太好了。

无论如何,我希望这对将来的人有帮助。

如果有人对整个对话感兴趣,那么问题如下:https://issuetracker.google.com/issues/63124894 https://issuetracker.google.com/issues/63124894

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

具有数据流的 Apache Beam - 从 BigQuery 读取时出现空指针 的相关文章

随机推荐