使用airflow的DataflowPythonOperator安排数据流作业时出错

2024-04-09

我正在尝试使用airflow 的DataflowPythonOperator 来安排数据流作业。这是我的 dag 运算符：

test = DataFlowPythonOperator(
    task_id = 'my_task',
    py_file = 'path/my_pyfile.py',
    gcp_conn_id='my_conn_id',
    dataflow_default_options={
        "project": 'my_project',
        "runner": "DataflowRunner",
        "job_name": 'my_job',
        "staging_location": 'gs://my/staging', 
        "temp_location": 'gs://my/temping',
        "requirements_file": 'path/requirements.txt'
  }
)

gcp_conn_id 已设置并且可以工作。错误显示数据流失败，返回代码为 1。完整日志如下。

[2018-07-05 18:24:39,928] {gcp_dataflow_hook.py:108} INFO - Start waiting for DataFlow process to complete.
[2018-07-05 18:24:40,049] {base_task_runner.py:95} INFO - Subtask: 
[2018-07-05 18:24:40,049] {models.py:1433} ERROR - DataFlow failed with return code 1
[2018-07-05 18:24:40,050] {base_task_runner.py:95} INFO - Subtask: Traceback (most recent call last):
[2018-07-05 18:24:40,050] {base_task_runner.py:95} INFO - Subtask: File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1390, in run
[2018-07-05 18:24:40,050] {base_task_runner.py:95} INFO - Subtask: result = task_copy.execute(context=context)
[2018-07-05 18:24:40,050] {base_task_runner.py:95} INFO - Subtask: File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/operators/dataflow_operator.py", line 182, in execute
[2018-07-05 18:24:40,050] {base_task_runner.py:95} INFO - Subtask: self.py_file, self.py_options)
[2018-07-05 18:24:40,050] {base_task_runner.py:95} INFO - Subtask: File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 152, in start_python_dataflow
[2018-07-05 18:24:40,050] {base_task_runner.py:95} INFO - Subtask: task_id, variables, dataflow, name, ["python"] + py_options)
[2018-07-05 18:24:40,051] {base_task_runner.py:95} INFO - Subtask: File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 138, in _start_dataflow
[2018-07-05 18:24:40,051] {base_task_runner.py:95} INFO - Subtask: _Dataflow(cmd).wait_for_done()
[2018-07-05 18:24:40,051] {base_task_runner.py:95} INFO - Subtask: File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 119, in wait_for_done
[2018-07-05 18:24:40,051] {base_task_runner.py:95} INFO - Subtask: self._proc.returncode))
[2018-07-05 18:24:40,051] {base_task_runner.py:95} INFO - Subtask: Exception: DataFlow failed with return code 1

gcp_dataflow_hook.py 似乎有问题，除此之外没有更多信息。有什么方法可以解决这个问题吗？有 DataflowPythonOperator 的示例吗？（到目前为止我找不到任何使用案例）

我没有收到相同的错误消息，但我认为这可能会有所帮助。 python Dataflow 运行程序似乎以一种奇怪的方式终止，该方式不会影响独立的 Dataflow 作业，但无法由 DataFlowPythonOperator python Airflow 类正确处理。我正在提交票证，但这里有一个解决方法可以解决我的问题。重要的！该补丁必须应用于 Dataflow 作业而不是 Airflow 作业。

在数据流作业的顶部添加以下导入

import threading
import time
import types   
from apache_beam.runners.runner import PipelineState

接下来在数据流代码上方添加以下内容。这主要是从主 ~dataflow.dataflow_runner 类中剪切和粘贴的，并带有注释编辑

def local_poll_for_job_completion(runner, result, duration):
    """Polls for the specified job to finish running (successfully or not).
    Updates the result with the new job information before returning.
    Args:
      runner: DataflowRunner instance to use for polling job state.
      result: DataflowPipelineResult instance used for job information.
      duration (int): The time to wait (in milliseconds) for job to finish.
        If it is set to :data:`None`, it will wait indefinitely until the job
        is finished.
    """
    last_message_time = None
    current_seen_messages = set()

    last_error_rank = float('-inf')
    last_error_msg = None
    last_job_state = None
    # How long to wait after pipeline failure for the error
    # message to show up giving the reason for the failure.
    # It typically takes about 30 seconds.
    final_countdown_timer_secs = 50.0
    sleep_secs = 5.0

    # Try to prioritize the user-level traceback, if any.
    def rank_error(msg):
        if 'work item was attempted' in msg:
            return -1
        elif 'Traceback' in msg:
            return 1
        return 0

    if duration:
        start_secs = time.time()
        duration_secs = duration // 1000

    job_id = result.job_id()
    keep_checking = True  ### Changed here!!!
    while keep_checking:  ### Changed here!!!
        response = runner.dataflow_client.get_job(job_id)
        # If get() is called very soon after Create() the response may not contain
        # an initialized 'currentState' field.
        logging.info("Current state: " + str(response.currentState))
        # Stop looking if the job is not terminating normally
        if str(response.currentState) in (  ### Changed here!!!
                'JOB_STATE_DONE',  ### Changed here!!!
                'JOB_STATE_CANCELLED',  ### Changed here!!!
                # 'JOB_STATE_UPDATED',
                'JOB_STATE_DRAINED',  ### Changed here!!!
                'JOB_STATE_FAILED'):  ### Changed here!!!
            keep_checking = False  ### Changed here!!!
            break
        if response.currentState is not None:
            if response.currentState != last_job_state:
                logging.info('Job %s is in state %s', job_id, response.currentState)
                last_job_state = response.currentState
            if str(response.currentState) != 'JOB_STATE_RUNNING':
                # Stop checking for new messages on timeout, explanatory
                # message received, success, or a terminal job state caused
                # by the user that therefore doesn't require explanation.
                if (final_countdown_timer_secs <= 0.0
                        or last_error_msg is not None
                        or str(response.currentState) == 'JOB_STATE_UPDATED'):  ### Changed here!!!
                    keep_checking = False  ### Changed here!!!
                    break

                # Check that job is in a post-preparation state before starting the
                # final countdown.
                if (str(response.currentState) not in (
                        'JOB_STATE_PENDING', 'JOB_STATE_QUEUED')):
                    # The job has failed; ensure we see any final error messages.
                    sleep_secs = 1.0      # poll faster during the final countdown
                    final_countdown_timer_secs -= sleep_secs

        time.sleep(sleep_secs)

        # Get all messages since beginning of the job run or since last message.
        page_token = None
        while True:
            messages, page_token = runner.dataflow_client.list_messages(
                job_id, page_token=page_token, start_time=last_message_time)
            for m in messages:
                message = '%s: %s: %s' % (m.time, m.messageImportance, m.messageText)

                if not last_message_time or m.time > last_message_time:
                    last_message_time = m.time
                    current_seen_messages = set()

                if message in current_seen_messages:
                    # Skip the message if it has already been seen at the current
                    # time. This could be the case since the list_messages API is
                    # queried starting at last_message_time.
                    continue
                else:
                    current_seen_messages.add(message)
                # Skip empty messages.
                if m.messageImportance is None:
                    continue
                logging.info(message)
                if str(m.messageImportance) == 'JOB_MESSAGE_ERROR':
                    if rank_error(m.messageText) >= last_error_rank:
                        last_error_rank = rank_error(m.messageText)
                        last_error_msg = m.messageText
            if not page_token:
                break

        if duration:
            passed_secs = time.time() - start_secs
            if passed_secs > duration_secs:
                logging.warning('Timing out on waiting for job %s after %d seconds',
                                job_id, passed_secs)
                break

    result._job = response
    runner.last_error_msg = last_error_msg


def local_is_in_terminal_state(self):
    logging.info("Current Dataflow job state: " + str(self.state))
    logging.info("Current has_job: " + str(self.has_job))
    if self.state in ('DONE', 'CANCELLED', 'DRAINED', 'FAILED'):
        return True
    else:
        return False


class DataflowRuntimeException(Exception):
    """Indicates an error has occurred in running this pipeline."""

    def __init__(self, msg, result):
        super(DataflowRuntimeException, self).__init__(msg)
        self.result = result


def local_wait_until_finish(self, duration=None):
    logging.info("!!!!!!!!!!!!!!!!You are in a Monkey Patch!!!!!!!!!!!!!!!!")
    if not local_is_in_terminal_state(self):  ### Changed here!!!
        if not self.has_job:
            raise IOError('Failed to get the Dataflow job id.')

        # DataflowRunner.poll_for_job_completion(self._runner, self, duration)
        thread = threading.Thread(
            target=local_poll_for_job_completion,  ### Changed here!!!
            args=(self._runner, self, duration))

        # Mark the thread as a daemon thread so a keyboard interrupt on the main
        # thread will terminate everything. This is also the reason we will not
        # use thread.join() to wait for the polling thread.
        thread.daemon = True
        thread.start()
        while thread.isAlive():
            time.sleep(5.0)

        terminated = local_is_in_terminal_state(self)  ### Changed here!!!
        logging.info("Terminated state: " + str(terminated))
        # logging.info("duration: " + str(duration))
        # assert duration or terminated, (  ### Changed here!!!
        #     'Job did not reach to a terminal state after waiting indefinitely.')  ### Changed here!!!

        assert terminated, "Timed out after duration: " + str(duration)  ### Changed here!!!

    else:  ### Changed here!!!
        assert False, "local_wait_till_finish failed at the start"  ### Changed here!!!

    if self.state != PipelineState.DONE:
        # TODO(BEAM-1290): Consider converting this to an error log based on
        # theresolution of the issue.
        raise DataflowRuntimeException(
            'Dataflow pipeline failed. State: %s, Error:\n%s' %
            (self.state, getattr(self._runner, 'last_error_msg', None)), self)

    return self.state

然后当你启动管道时使用约定（不是“with beam.Pipeline(options=pipeline_options) p:”版本）

p = beam.Pipeline(options=pipeline_options)

最后，当您的管道建成后，请使用以下命令

result = p.run()
# Monkey patch to better handle termination
result.wait_until_finish = types.MethodType(local_wait_until_finish, result)
result.wait_until_finish()

Note:如果您运行的是气流服务器 v1.9，就像我使用 1.10 补丁文件一样，此修复程序仍然无法解决问题。 _Dataflow.wait_for_done 的补丁文件函数没有返回 job_id，但它也需要。补丁的补丁比上面的还差。如果可以的话升级一下。如果您无法使用最新文件将以下代码作为标题粘贴到 Dag 脚本中，那么它应该可以工作。气流/contrib/hooks/gcp_api_base_hook.py、气流/contrib/hooks/gcp_dataflow_hook.py 和气流/contrib/operators/dataflow_operator.py

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

airflow

dataflow

使用airflow的DataflowPythonOperator安排数据流作业时出错的相关文章

使用 conda env 的 apache-airflow systemd 文件

我正在尝试奔跑apache airflow在 Ubuntu 16 04 文件上使用 systemd 我大致跟着本教程 https github com hgrif airflow tutorial并安装设置以下内容 Miniconda
如何在平面文件连接管理器上重新配置列信息？

我有一个正在从平面文件读取数据的平面文件源我们最近在此平面文件中添加了一个新列平面文件数据被插入到数据库表中为了适应目标组件中的新字段我使用了ALTER TABLE语句将新列添加到表中这是我所做的唯一改变平面文件和目标组件之间的
Airflow BashOperator 日志不包含完整输出

我遇到一个问题 BashOperator 没有记录 wget 的所有输出它只会记录输出的前 1 5 行我已经尝试过仅使用 wget 作为 bash 命令 tester BashOperator task id testing bash
如何以编程方式使用 localstack s3 端点设置 Airflow 1.10 日志记录？

为了尝试将气流日志记录到 localstack s3 存储桶对于本地和 kubernetes 开发环境我遵循用于记录到 s3 的气流文档 https airflow apache org docs 1 10 1 howto write
无法通过在 Apache Beam 中创建模板来按所需顺序运行多个管道

我有两个独立的管道分别为 P1 和 P2 根据我的要求我只需要在 P1 完全完成执行后才运行 P2 我需要通过一个模板完成整个操作基本上模板在找到 run 方式即 p1 run 时就被创建所以我可以看到我需要使用两个不同的模板
气流动态 dag 创建

有人请告诉我气流中的 DAG 是否只是一个图表如占位符没有任何与其关联的实际数据如参数或者 DAG 是否像一个实例对于固定参数我想要一个系统其中要执行的操作集给定一组参数是固定的但每次运行这组操作时该输入都会不同简单
气流池使用的插槽大于插槽限制

有三个传感器任务并使用相同的池池 limit sensor 设置为1 但池限制不起作用三个池一起运行 sensor wait SqlSensor task id sensor wait dag dag conn id dest data
数据流编程和响应式编程有什么区别？

我实在看不出他们之间有什么区别它们都与指令中的数据流动和输入数据变化的传播有关我读了这本书作者马特卡尔基 https deepfriedcode com books darps 它清楚地表明它们都是相同的另一方面维基百科 ht
没有这样的文件或目录 /airflow/xcom/return.json

创建了一个图像包含 airflow xcom return json在所有子目录上使用 chmod x 由于日志显示找不到文件或目录尝试过 chmod x strtpodbefore KubernetesPodOperator names
使用 AWS ElastiCache 请求中的 Airflow CROSSSLOT 密钥未散列到同一插槽错误

我在 AWS ECS 上运行 apache airflow 1 8 1 并且有一个 AWS ElastiCache 集群 redis 3 2 4 运行 2 个分片 2 个启用多可用区的节点集群 Redis 引擎我已经验证气流可以毫无问题
如何检查何时为特定 dag 安排了下一次 Airflow DAG 运行？

我已设置气流并运行一些 DAG 安排每天一次 0 0 我想检查下次计划运行特定 dag 的时间但我看不到可以在管理员中执行此操作的位置如果你愿意你可以使用Airflow s CLI 有next execution option htt
气流中的execution_date：需要作为变量访问

我真的是这个论坛的新手但有一段时间我一直在为我们公司玩气流抱歉如果这个问题听起来很愚蠢我正在使用一堆 BashOperators 编写一个管道基本上对于每个任务我想简单地使用 curl 调用 REST api 这就是我的管道
Dataflow 2.1.0 中是否有 IntrabundleParallelization 的替代方案？

根据 dataflow 2 X 的发行说明 IntraBundleParallelization 已被删除有没有办法控制增加数据流 2 1 0 上 DoFns 的并行度当我在 1 9 0 版本的数据流上使用 IntrabundlePa
AIRFLOW：在 jinja 模板中为 {{ds}} 使用 .replace() 或relativedelta()

我的目标是根据气流宏变量 ds 返回上个月的第一天并使用它例如在 Hive 操作符中例如对于 ds 2020 05 09 我预计返回 2020 04 01 我找到并尝试的解决方案是 SET hivevar LAST MONTH ds
数据流并发的一个很好的激励示例是什么？

我了解数据流编程的基础知识并且在Clojure API http richhickey github com clojure contrib dataflow api html 乔纳斯博纳的演讲 http www slideshare
气流，在 dag 运行之前标记任务成功或跳过它

我们有一个巨大的 DAG 其中有许多小而快速的任务和一些大而耗时的任务我们只想运行 DAG 的一部分我们发现最简单的方法是不添加我们不想运行的任务问题是我们的 DAG 有很多相互依赖关系因此当我们想要跳过某些任务时不破坏 DAG
使用 Airflow BigqueryOperator 向 BigQuery 表添加标签

我必须向 bigquery 表添加标签我知道可以通过 BigQuery UI 来完成此操作但如何通过气流运算符来完成此操作 Use case 用于计费和搜索目的由于多个团队在同一项目和数据集下工作我们需要将各个团队创建的所有表组合在
为什么我的 Airflow 任务被“外部设置为失败”？

我使用的是 Airflow 2 0 0 我的任务在运行几秒钟或几分钟后偶尔会被外部终止任务通常会成功运行都是通过以下方式启动的手动任务 airflow tasks test 以及计划的 DAG 运行所以我相信这与我的 DAG 代码
气流获取重试次数

在我的 Airflow DAG 中我有一个任务需要知道它是第一次运行还是重试运行如果是重试尝试我需要调整任务中的逻辑我对如何存储任务的重试次数有一些想法但我不确定其中是否有合法的或者是否有更简单的内置方法可以在任务中获取此信息
Amazon MWAA Airflow - 任务容器在没有日志的情况下关闭/停止/终止

我们使用 Amazon MWAA Airflow 很少有任务标记为 FAILED 但根本没有日志就好像容器在我们没有注意到的情况下被关闭了一样我找到了这个链接 https cloud google com composer docs h

随机推荐

Javacard 中的 ECDSA 签名

我正在 Javacard 中使用 ECDSA 实现签名代码我的代码在异常部分输出 0x0003 NO SUCH ALGORITHM 这意味着该卡不支持该算法我不明白这一点因为我的供应商告诉我它支持 ECC 我的结论是我不知道如何使用
org.json.JSONException：名称没有值

下面的代码中出现此错误的原因可能是什么 loginButton setOnClickListener new View OnClickListener Override public void onClick View v final St
获取系统中已安装的应用程序

如何使用c 代码获取系统中安装的应用程序遍历注册表项 SOFTWARE Microsoft Windows CurrentVersion Uninstall 似乎可以提供已安装应用程序的完整列表除了下面的示例之外您还可以找到与我所做的
SKPhysicsJoint：接触和碰撞不起作用

在 IOS7 1 上使用 SpriteKit 我创建了两个简单的矩形精灵以及相应的物理体我设置了接触和碰撞位掩码所有工作都完全符合我的预期检测到接触并且碰撞防止两个矩形重叠但是当我创建 SKPhysicsJointSpring
java中相关对象的序列化

假设我有 A B 和 C 类型的对象我有 3 个 Map 分别包含 A B 和 C 的所有实例在内部 A和B都有C的Map 我希望能够随时存储和恢复应用程序的状态因此直到今天我总是序列化类似金字塔的应用程序我会在顶部对象上调用序列
PostgreSQL Sqlalchemy 提交需要大量时间

当我尝试将更改提交到表中时需要花费大量时间每 1000 行大约 300 秒型号类别 class Quotes base tablename quotes id Column INTEGER primary key True autoi
Apache + mod_wsgi 与 nginx + Gunicorn

我想部署一个django站点它是github上的开源edx代码我面临着使用之间的选择 Apache 与 mod wsgi nginx 与 Gunicorn 我已经将 Apache 与 mod wsgi 一起使用它很酷但我对第二个选项
SQL Server Reporting Services 对聚合数据运行总计

每个人在 SSRS 中我们有 2 列如下所示 Sales Running Sales 5 00 5 00 3 00 8 00 1 00 9 00 区别在于第一列销售额是一个分组行因此要获取每行的销售额总计我们使用 Sum F
Mathematica：为什么 3D 绘图会记住最后的视点/旋转，即使在再次评估后也是如此？

我觉得这有点烦人我制作了一个 3D 绘图最初它以默认方向出现然后我使用鼠标以某种方式旋转它现在我再次运行该命令期望获得原始形状即通过鼠标旋转它之前的原始方向但相反它只是给了我与屏幕上相同的绘图即它似乎保留记住了该输出单
从函数返回结果（javascript、nodejs）

谁能帮我处理这段代码吗我需要从 routeToRoom 函数返回一个值 var sys require sys function routeToRoom userId passw var roomId 0 var nStore requi
Memcache 统计数据理解

Memcache telnet 接口有命令称为STATS 它显示了很多数字我在哪里可以看到它的含义如何分析它们多少缓存使用是有效的等等更新的文档位于https github com memcached memcached blob
在我更改 PHP 设置后，gzopen 函数不存在 [关闭]

很难说出这里问的是什么这个问题是含糊的模糊的不完整的过于宽泛的或修辞性的无法以目前的形式得到合理的回答如需帮助澄清此问题以便重新打开访问帮助中心 help reopen questions 使用新的 PHP 5 4gzopen
Firebase 推送通知不适用于 iOS

我想使用 Firebase Cloud Messaging 实现推送通知我已经按照说明设置了我的项目并上传了 APN 证书我正在使用发送测试消息fcmtoken到我的真实设备我在AppDelegate中的配置如下 func appli
是否可以在不同的类中编写/包装异常处理组件（try、catch）？

这是关于将异常处理逻辑包装在某种类中在写c 的时候代码中很多时候我们需要根据客户端抛出的异常来捕获许多类型变体这导致我们在 catch 子句中多次编写类似类型的代码在下面的示例中我编写了 function 它可以以多种可能
Access 2010 SQL 查询仅在完整单词的字符串中查找部分匹配

我希望这是一个简单的我只是找不到如何获得我想要的结果也许我在 SQL 中使用了错误的关键字我正在搜索包含全名字段的员工表该字段可以是 Sam 或 Evans 先生或 Sam Evans 先生我正在尝试查找与另一个包含名称字段的名
RecyclerView onClick 无法正常工作？

我在片段中使用 RecyclerView 来显示带有网格格式文本的图像 Recycler 视图 grid item xml 如下所示
队列管理和新线程

在 Net 4 0 框架上使用 C 我有一个 Windows 窗体主线程迄今为止唯一的一个等待文件系统事件然后必须对这些事件提供的文件启动一些预定义的处理我计划执行以下操作 A1 当主进程启动时立即创建一个单独的线程 A2 让主线程
python中按特定顺序读取文件

假设我的文件夹中有三个文件 file9 txt file10 txt 和 file11 txt 我想按这个特定顺序读取它们谁能帮我这个现在我正在使用代码 import glob os for infile in glob glob os
无法获取 OLEObject 类的 Object 属性 - Excel Interop

我用谷歌搜索了这个问题但未能找到解决方案如果文件保存为 xls 而不是 xlsm 则此代码有效我使用的是 Office 2013 32 位我编写了一个 COM 公开的 C 类库 Excel工作簿实例化一个对象并传入对当前工作簿的引用
使用airflow的DataflowPythonOperator安排数据流作业时出错

我正在尝试使用airflow 的DataflowPythonOperator 来安排数据流作业这是我的 dag 运算符 test DataFlowPythonOperator task id my task py file path my

使用airflow的DataflowPythonOperator安排数据流作业时出错

使用airflow的DataflowPythonOperator安排数据流作业时出错 的相关文章

随机推荐

热门标签

使用airflow的DataflowPythonOperator安排数据流作业时出错的相关文章