AWS Glue 作业因连接超时错误而失败

2024-01-28

我是 AWS Glue 的新手。我创建了一个作业,它使用两个数据目录表并在它们之上运行简单的 SparkSQL 查询。作业在转换步骤失败并出现异常

pyspark.sql.utils.AnalysisException: 'java.lang.RuntimeException: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to glue.us-east-1.amazonaws.com:443 [blah] failed: connect timed out;'

JDBC 源 (Redshift) VPC 安全组已配置入站和出站规则。

我在 SO 上看到了另一篇关于为 Glue 本身配置 VPC 端点的帖子,但我不太明白它应该是什么样子?它应该是glue.us-east-1.amazonaws.com:443 的接口还是其他东西?我很困惑。

UPD:自动生成的 pyspark 脚本

## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "redshift_catalog", redshift_tmp_dir = TempDir, table_name = "analytics_mongo_raw_conversations", transformation_ctx = "DataSource0"]
## @return: DataSource0
## @inputs: []
DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "redshift_catalog", redshift_tmp_dir = args["TempDir"], table_name = "analytics_mongo_raw_conversations", transformation_ctx = "DataSource0")
## @type: DataSource
## @args: [database = "redshift_catalog", redshift_tmp_dir = TempDir, table_name = "analytics_mongo_raw_messages", transformation_ctx = "DataSource1"]
## @return: DataSource1
## @inputs: []
DataSource1 = glueContext.create_dynamic_frame.from_catalog(database = "redshift_catalog", redshift_tmp_dir = args["TempDir"], table_name = "analytics_mongo_raw_messages", transformation_ctx = "DataSource1")
## @type: SqlCode
## @args: [sqlAliases = {"messages": DataSource1, "conversations": DataSource0}, sqlName = SqlQuery0, transformation_ctx = "Transform0"]
## @return: Transform0
## @inputs: [dfc = DataSource1,DataSource0]
Transform0 = sparkSqlQuery(glueContext, query = SqlQuery0, mapping = {"messages": DataSource1, "conversations": DataSource0}, transformation_ctx = "Transform0")
job.commit()

我能够解决这个问题,确实必须有一个 VPC 端点。 除此之外,连接还应该使用带有 NAT 网关的私有子网。我最初的子网没有 NAT。

Terraform 中的 VPC 终端节点配置示例:

resource "aws_vpc_endpoint" "glue" {
  vpc_id            = var.vpc_id
  service_name      = var.glue_vpc_service_name
  vpc_endpoint_type = "Interface"

  security_group_ids = var.security_group_ids 
  subnet_ids = var.subnet_ids

  tags = { mytag = "mytag"}
}
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

AWS Glue 作业因连接超时错误而失败 的相关文章

随机推荐