我是 AWS Glue 的新手。我创建了一个作业,它使用两个数据目录表并在它们之上运行简单的 SparkSQL 查询。作业在转换步骤失败并出现异常
pyspark.sql.utils.AnalysisException: 'java.lang.RuntimeException: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to glue.us-east-1.amazonaws.com:443 [blah] failed: connect timed out;'
JDBC 源 (Redshift) VPC 安全组已配置入站和出站规则。
我在 SO 上看到了另一篇关于为 Glue 本身配置 VPC 端点的帖子,但我不太明白它应该是什么样子?它应该是glue.us-east-1.amazonaws.com:443 的接口还是其他东西?我很困惑。
UPD:自动生成的 pyspark 脚本
## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "redshift_catalog", redshift_tmp_dir = TempDir, table_name = "analytics_mongo_raw_conversations", transformation_ctx = "DataSource0"]
## @return: DataSource0
## @inputs: []
DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "redshift_catalog", redshift_tmp_dir = args["TempDir"], table_name = "analytics_mongo_raw_conversations", transformation_ctx = "DataSource0")
## @type: DataSource
## @args: [database = "redshift_catalog", redshift_tmp_dir = TempDir, table_name = "analytics_mongo_raw_messages", transformation_ctx = "DataSource1"]
## @return: DataSource1
## @inputs: []
DataSource1 = glueContext.create_dynamic_frame.from_catalog(database = "redshift_catalog", redshift_tmp_dir = args["TempDir"], table_name = "analytics_mongo_raw_messages", transformation_ctx = "DataSource1")
## @type: SqlCode
## @args: [sqlAliases = {"messages": DataSource1, "conversations": DataSource0}, sqlName = SqlQuery0, transformation_ctx = "Transform0"]
## @return: Transform0
## @inputs: [dfc = DataSource1,DataSource0]
Transform0 = sparkSqlQuery(glueContext, query = SqlQuery0, mapping = {"messages": DataSource1, "conversations": DataSource0}, transformation_ctx = "Transform0")
job.commit()