如何在 HDP 3.1 中不使用仓库连接器的情况下从 Spark 将表写入 Hive

2023-12-06

当尝试在 HDP 3.1 上使用 Spark 2.3 写入 Hive 表时，无需使用仓库连接器直接写入 Hive 模式：

spark-shell --driver-memory 16g --master local[3] --conf spark.hadoop.metastore.catalog.default=hive
val df = Seq(1,2,3,4).toDF
spark.sql("create database foo")
df.write.saveAsTable("foo.my_table_01")

失败并显示：

Table foo.my_table_01 failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional

but a:

val df = Seq(1,2,3,4).toDF.withColumn("part", col("value"))
df.write.partitionBy("part").option("compression", "zlib").mode(SaveMode.Overwrite).format("orc").saveAsTable("foo.my_table_02")

火花与spark.sql("select * from foo.my_table_02").show工作得很好。现在前往 Hive/beeline：

0: jdbc:hive2://hostname:2181/> select * from my_table_02;
Error: java.io.IOException: java.lang.IllegalArgumentException: bucketId out of range: -1 (state=,code=0)

 describe extended my_table_02;

returns

 +-----------------------------+----------------------------------------------------+----------+
|          col_name           |                     data_type                      | comment  |
+-----------------------------+----------------------------------------------------+----------+
| value                       | int                                                |          |
| part                        | int                                                |          |
|                             | NULL                                               | NULL     |
| # Partition Information     | NULL                                               | NULL     |
| # col_name                  | data_type                                          | comment  |
| part                        | int                                                |          |
|                             | NULL                                               | NULL     |
| Detailed Table Information  | Table(tableName:my_table_02, dbName:foo, owner:hive/[email protected], createTime:1571201905, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:value, type:int, comment:null), FieldSchema(name:part, type:int, comment:null)], location:hdfs://bd-sandbox.t-mobile.at:8020/warehouse/tablespace/external/hive/foo.db/my_table_02, inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.ql.io.orc.OrcSerde, parameters:{path=hdfs://bd-sandbox.t-mobile.at:8020/warehouse/tablespace/external/hive/foo.db/my_table_02, compression=zlib, serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[FieldSchema(name:part, type:int, comment:null)], parameters:{numRows=0, rawDataSize=0, spark.sql.sources.schema.partCol.0=part, transient_lastDdlTime=1571201906, bucketing_version=2, spark.sql.create.version=2.3.2.3.1.0.0-78, totalSize=740, spark.sql.sources.schema.numPartCols=1, spark.sql.sources.schema.part.0={\"type\":\"struct\",\"fields\":[{\"name\":\"value\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"part\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}, numFiles=4, numPartitions=4, spark.sql.partitionProvider=catalog, spark.sql.sources.schema.numParts=1, spark.sql.sources.provider=orc, transactional=true}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, rewriteEnabled:false, catName:hive, ownerType:USER, writeId:-1) |

如何使用spark向hive写入数据不使用仓库连接器但仍然写入同一个元存储，稍后可以由 hive 读取？据我所知，外部表应该是可能的（它们不是托管的，不是 ACID 不是事务性的），但我不知道如何告诉saveAsTable如何处理这些。

edit

我刚刚发现https://community.cloudera.com/t5/Support-Questions/Spark-hive-warehouse-connector-not-loading-data-when-using/td-p/243613

And:

执行（）与执行查询（）

ExecuteQuery() 将始终使用 Hiveserver2-interactive/LLAP 使用快速 ARROW 协议。当 jdbc URL 指向时使用它非 LLAP Hiveserver2 将产生错误。

Execute() 使用 JDBC，并且不依赖于 LLAP，但有内置限制最多仅返回 1.000 条记录。但对于大多数人来说不是一个查询（INSERT INTO ... SELECT、计数、总和、平均值）问题。

但这是否会破坏 hive 和 Spark 之间的高性能互操作性？特别是在没有足够的 LLAP 节点可用于大规模 ETL 的情况下。

事实上，这是事实。该设置可以在以下位置配置https://github.com/hortonworks-spark/spark-llap/blob/26d164e62b45cfa1420d5d43cdef13d1d29bb877/src/main/java/com/hortonworks/spark/sql/hive/llap/HWConf.java#L39，尽管我不确定增加该值对性能的影响

你试过了吗

    data.write \
        .mode("append") \
        .insertInto("tableName")

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)