Hive Transaction(Hive 事务管理)

2023-11-17

Hive 事务在 Hive 3 得到增强。

hive-site.xml 配置

<property>
   <name>hive.txn.manager</name>
   <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
   <description>
     Set to org.apache.hadoop.hive.ql.lockmgr.DbTxnManager as part of turning on Hive
     transactions, which also requires appropriate settings for hive.compactor.initiator.on,
     hive.compactor.worker.threads, hive.support.concurrency (true),
     and hive.exec.dynamic.partition.mode (nonstrict).
     The default DummyTxnManager replicates pre-Hive-0.13 behavior and provides
     no transactions.
   </description>
 </property>
 <property>
   <name>hive.support.concurrency</name>
   <value>true</value>
 </property>
 <property>
   <name>hive.compactor.initiator.on</name>
   <value>true</value>
   <description>
     Whether to run the initiator and cleaner threads on this metastore instance or not.
     Set this to true on one instance of the Thrift metastore service as part of turning
     on Hive transactions. For a complete list of parameters required for turning on
     transactions, see hive.txn.manager.
   </description>
 </property>
 <property>
   <name>hive.compactor.worker.threads</name>
   <value>2</value>
   <description>
     How many compactor worker threads to run on this metastore instance. Set this to a
     positive number on one or more instances of the Thrift metastore service as part of
     turning on Hive transactions. For a complete list of parameters required for turning
     on transactions, see hive.txn.manager.
     Worker threads spawn MapReduce jobs to do compactions. They do not do the compactions
     themselves. Increasing the number of worker threads will decrease the time it takes
     tables or partitions to be compacted once they are determined to need compaction.
     It will also increase the background load on the Hadoop cluster as more MapReduce jobs
     will be running in the background.
   </description>
 </property>
 <property>
   <name>hive.exec.dynamic.partition.mode</name>
   <value>nonstrict</value>
   <description>
     In strict mode, the user must specify at least one static partition
     in case the user accidentally overwrites all partitions.
     In nonstrict mode all partitions are allowed to be dynamic.
   </description>
 </property>

Hive 创建的表自动是事务表的配置。

<property>
   <name>metastore.strict.managed.tables</name>
   <value>false</value>
   <description>
     Whether strict managed tables mode is enabled. With this mode enabled, only transactional tables (both full and insert-only) are allowed to be created as managed tables
   </description>
 </property>
 <property>
   <name>hive.create.as.insert.only</name>
   <value>false</value>
   <description>
     Whether the eligible tables should be created as ACID insert-only by default. Does not apply to external tables, the ones using storage handlers, etc.
   </description>
 </property>
  <property>
   <name>metastore.create.as.acid</name>
   <value>false</value>
   <description>
     Whether the eligible tables should be created as full ACID by default. Does not apply to external tables, the ones using storage handlers, etc.
   </description>
 </property>

测试

创建事务表

create table t1(c1 int,c2 int) stored as orc tblproperties('transactional'='true');

执行以下操作。执行之后，可以看到每个操作在表的目录下生成相应的 delta 目录。

insert into t1 values(1,1),(2,2),(3,3);
insert into t1 values(4,4);
insert into t1 values(5,5);
insert into t1 values(6,6);
insert into t1 values(7,7);
insert into t1 values(8,8);
insert into t1 values(9,9);
insert into t1 values(10,10);
insert into t1 values(11,11);
insert into t1 values(12,12);
insert into t1 values(13,13);

delete from t1 where c1=13;
insert into t1 values(13,14);
delete from t1 where c1=13;
insert into t1 values(13,15);

insert_only 事务

insert_only 事务不要求表必须是 orc 格式，可以是任何格式，如 parquet。
t2 表仅支持 insert，不支持 delete, update。
insert 语句成功， delete 失败。

create table t2(c1 int,c2 int) stored as orc tblproperties('transactional'='true','transactional_properties'='insert_only');

insert into t2 values(1,1),(2,2),(3,3);
delete from t2 where c1=3;

修改现有表支持事务

全部事务支持

表 t3 从创建的时候没有支持事务，用 alter table 支持事务，之后可以执行 insert, delete 操作。

create table t3(c1 int,c2 int) stored as orc;
alter table t3 set  tblproperties('transactional'='true','transactional_properties'='default');
insert into t3 values(1,1),(2,2),(3,3);
delete from t3 where c1=3;

insert_only 事务支持

create table t4(c1 int,c2 int) stored as orc;
alter table t4 set  tblproperties('transactional'='true','transactional_properties'='insert_only');
insert into t4 values(1,1),(2,2),(3,3);

Hive 官方文档https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

常见问题

问题原因总结：

一个表，可以同时运行多个 SQL 进行 insert 和 select 操作。但是不允许同时运行多个 SQL 进行 UPDATE， DELETE 或者 MERGE操作, 因为他们都会生成 delete_delta 目录。如两个 SQL，它们同时 update 不同的记录也会报错。
SQL 解析阶段获取一个表的 write_id list，但是在执行之前获取锁，在执行之后释放锁，导致SQL 解析和执行之前会出现并发错误。

问题列表

作业报错
如果在执行的时候，发生了 compaction，compaction 之后删除了原来的文件，导致抛出 FileNotFoundException: File does not exist:.
2 个会话同时执行 insert overwrite 抛出 LockException

创建表

create table t1(c1 int) stored as orc tblproperties('transactional'='true');

在两个窗口里分别用 beeline 连接 hiveserver。

在会话1，执行以下命令，函数 timesleep 会 sleep 10秒钟，返回 10001。

insert overwrite table t1 select default.timesleep(10000);

在会话2，执行以下命令，会话2 的命令在会话1之前执行完毕。

insert overwrite table t1 select 1;

会话1抛出以下异常

ERROR : FAILED: Hive Internal Error: org.apache.hadoop.hive.ql.lockmgr.LockException(Transaction manager has aborted the transaction txnid:126.  Reason: Aborting [txnid:126,128] due to a write conflict on test/t1 committed by [txnid:127,127] u/u)
org.apache.hadoop.hive.ql.lockmgr.LockException: Transaction manager has aborted the transaction txnid:126.  Reason: Aborting [txnid:126,128] due to a write conflict on test/t1 committed by [txnid:127,127] u/u
	at org.apache.hadoop.hive.ql.lockmgr.DbTxnManager.commitTxn(DbTxnManager.java:670)

由于函数在编译阶段给出具体值。

select * from t1;
+--------+
| t1.c1  |
+--------+
| 1      |
| 2      |
| 3      |
+--------+

以下两个语句：
会话1：

insert overwrite table t2 select c1,default.timesleep(10000 * c1) from t1;

会话2：

insert overwrite table t2 select c1,default.timesleep(10 * c1) from t1;

会话2 会在 commit 的时候失败。

ERROR : FAILED: Hive Internal Error: org.apache.hadoop.hive.ql.lockmgr.LockException(Transaction manager has aborted the transaction txnid:163.  Reason: Aborting [txnid:163,163] due to a write conflict on test/t2 committed by [txnid:162,163] u/u)
org.apache.hadoop.hive.ql.lockmgr.LockException: Transaction manager has aborted the transaction txnid:163.  Reason: Aborting [txnid:163,163] due to a write conflict on test/t2 committed by [txnid:162,163] u/u
	at org.apache.hadoop.hive.ql.lockmgr.DbTxnManager.commitTxn(DbTxnManager.java:670)

在执行 insert overwrite 时，在另外的会话执行drop table，导致第 1 个会话抛出表找不到异常。示例如下：

0: jdbc:hive2://localhost:10000/default> insert overwrite table t1 select default.timesleep(10000);
Error: Error while compiling statement: FAILED: SemanticException [Error 10001]: Line 1:23 Table not found 't1' (state=42S02,code=10001)

show transactions 可以显示 aborted 的 transaction。

show transactions;
+-----------------+--------------------+----------------+----------------------+-------------+------------------------+
|      txnid      |       state        |  startedtime   |  lastheartbeattime   |    user     |          host          |
+-----------------+--------------------+----------------+----------------------+-------------+------------------------+
| Transaction ID  | Transaction State  | Started Time   | Last Heartbeat Time  | User        | Hostname               |
| 126             | ABORTED            | 1646189665000  | 1646189665000        | houzhizhen  | localhost.localdomain  |
| 130             | OPEN               | 1646189960000  | 1646189973000        | houzhizhen  | localhost.localdomain  |
+-----------------+--------------------+----------------+----------------------+-------------+------------------------+

两个会话同时 insert 不会有冲突。

参考资料：
官方文档：Hive+Transactions
PPT Transactional Operations in Apache Hive:Present and Future

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

hive

Hadoop

big data

Hive Transaction(Hive 事务管理) 的相关文章

Hadoop：读取ORC文件并放入RDBMS中？

我有一个以 ORC 文件格式存储的配置单元表我想将数据导出到 Teradata 数据库我研究了 sqoop 但找不到导出 ORC 文件的方法有没有办法让 sqoop 为 ORC 工作或者有什么其他工具可以用来导出数据 Thanks
消息：Hive 架构版本 1.2.0 与 Metastore 的架构版本 2.1.0 不匹配 Metastore 未升级或损坏

环境 spark2 11 hive2 2 hadoop2 8 2 hive shell 运行成功并且没有错误或警告但是当运行application sh时启动失败 usr local spark bin spark submit cl
R+Hadoop：如何从HDFS读取CSV文件并执行mapreduce？

在以下示例中 small ints to dfs 1 1000 mapreduce input small ints map function k v cbind v v 2 MapReduce函数的数据输入是一个名为small ints的
当气流 initdb 时，导入错误：无法导入名称 HiveOperator

我最近安装了airflow对于我的工作流程在创建项目时我执行了以下命令 airflow initdb 返回以下错误 2016 08 15 11 17 00 314 init py 36 INFO Using executor Seque
伪分布式模式下的 Hadoop。连接被拒绝

P S 请不要将此标记为重复 Hi 我一直在尝试以伪分布式模式设置和运行 Hadoop 当我运行 start all sh 脚本时我得到以下输出 starting namenode logging to home raveesh Hado
YARN UNHEALTHY 节点

在我们的 YARN 集群已满 80 的情况下我们看到一些纱线节点管理器被标记为不健康在深入研究日志后我发现这是因为数据目录的磁盘空间已满 90 出现以下错误 2015 02 21 08 33 51 590 INFO org apach
纱线上的火花，连接到资源管理器 /0.0.0.0:8032

我正在我的开发机器 Mac 上编写 Spark 程序 hadoop的版本是2 6 spark的版本是1 6 2 hadoop集群有3个节点当然都在linux机器上我在idea IDE中以spark独立模式运行spark程序它运行成功
hive sql查找最新记录

该表是 create table test id string name string age string modified string 像这样的数据 id name age modifed 1 a 10 2011 11 11 11 1
将 CSV 转换为序列文件

我有一个 CSV 文件我想将其转换为 SequenceFile 我最终将使用它来创建 NamedVectors 以在聚类作业中使用我一直在使用 seqdirectory 命令尝试创建 SequenceFile 然后使用 nv 选项将该输
非 hdfs 文件系统上的 hadoop/yarn 和任务并行化

我已经实例化了 Hadoop 2 4 1 集群并且发现运行 MapReduce 应用程序的并行化方式会有所不同具体取决于输入数据所在的文件系统类型使用 HDFS MapReduce 作业将生成足够的容器以最大限度地利用所有可用内存
如何创建 HIVE 表来读取分号分隔值

我想创建一个 HIVE 表该表将以分号分隔的值读取但我的代码不断给出错误有没有人有什么建议 CREATE TABLE test details Time STRING Vital STRING sID STRING PARTITION
如何从hdfs读取文件[重复]

这个问题在这里已经有答案了我在 project1目录下的hadoop文件系统中有一个文本文件名mr txt 我需要编写 python 代码来读取文本文件的第一行而不将 mr txt 文件下载到本地但我无法从 hdfs 打开 mr tx
HDFS 中的文件数量与块数量

我正在运行单节点 hadoop 环境当我跑的时候 hadoop fsck user root mydatadir block 我真的对它给出的输出感到困惑 Status HEALTHY Total size 998562090 B Tot
更改 Spark Streaming 中的输出文件名

我正在运行一个 Spark 作业就逻辑而言它的性能非常好但是当我使用 saveAsTextFile 将文件保存在 s3 存储桶中时输出文件的名称格式为 part 00000 part 00001 等有没有办法更改输出文件名谢谢
hive - 在值范围之间将一行拆分为多行

我在下面有一张表想按从开始列到结束列的范围拆分行即 id 和 value 应该对开始和结束之间的每个值重复包括两者 id value start end 1 5 1 4 2 8 5 9 所需输出 id value current
Pig 10.0 - 将元组分组并在 foreach 中合并包

我在用着Pig 10 0 我想在 foreach 中合并包假设我有以下内容visitors alias a b 1 2 3 4 a d 1 3 6 a e 7 z b 1 2 3 我想对第一个字段上的元组进行分组并将包与一组语义合并以获
获取从开始日期到结束日期的活跃周数

我的订阅数据如下所示数据显示用户何时购买订阅它有user id subscription id start date and end date 我已经得出wk start and wk end从中 user subscription i
Hadoop 推测任务执行

在Google的MapReduce论文中他们有一个备份任务我认为这与Hadoop中的推测任务是一样的推测任务是如何实现的当我启动一项推测任务时该任务是从一开始就作为较旧且缓慢的任务开始还是从较旧的任务到达的位置开始如果是这样
从 Spark 访问 Hdfs 会出现令牌缓存错误 Can't get Master Kerberosprincipal for use as renewer

我正在尝试运行测试 Spark 脚本以便将 Spark 连接到 hadoop 脚本如下 from pyspark import SparkContext sc SparkContext local Simple App file sc t
java.lang.ClassNotFoundException：找不到类 org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem

我是 Spark 和 Kubernetes 世界的新手我使用 docker image tool sh 实用程序使用与 Hadoop 3 2 捆绑在一起的官方 Spark 3 0 1 构建了 Spark docker 映像我还为 Jup

随机推荐

arcpy导入报错 “ImportRrror: No module named arcpy”

在使用ArcGIS自带的Python IDLE处理数据的时候导入arcpy报错 ImportError No module named arcpy 我遍历了各解决方法依然无法成功导入arcpy 后经过查询探索通过如下方法得以成功解决
aoj1303

继续python系列 python能够自动推断类型这个太好用了根本不用声明类型自己根据运行情况推断出所用的类型所以在定义函数的时候根本不用声明参数的类型下面这个题目aoj1303 求2的指数如下 def gethex a li w
关于飞书的告警通知，这里有个更好的办法

飞书是字节跳动于2016年自研的新一代一站式协作平台是保障字节跳动全球五万人高效协作的办公工具飞书将即时沟通日历云文档云盘和工作台深度整合通过开放兼容的平台让成员在一处即可实现高效的沟通和流畅的协作全方位提升企业效率 20
Git 使用

Git 一 Git基础 1 Git介绍 Git是目前世界上最先进的分布式版本控制系统 2 Git与Github 2 1 两者区别 Git是一个分布式版本控制系统简单的说其就是一个软件用于记录一个或若干文件内容变化以便将来查阅特定版本修
模板类、模板函数的模板类型显式实例化及其用途（转载）

转载自 C 11模板隐式实例化显式实例化声明定义简单易懂云飞扬 Dylan的博客 CSDN博客模板隐式实例化 1 隐式实例化在代码中实际使用模板类构造对象或者调用模板函数时编译器会根据调用者传给模板的实参进行模板类型推导然后对
【LAMMPS系列】LAMMPS软件安装资料包

大家好我是粥粥 LAMMPS 是一种经典的分子动力学代码专注于材料建模它是大型原子分子大规模并行模拟器的首字母缩略词 LAMMPS 具有固态材料金属半导体和软物质生物分子聚合物以及粗粒或中等系统的势函数它可用于模拟原子
自定义多数据源JDBC连接池

背景公司需要对各个客户的数据库进行统一管理故涉及到对多个不同数据库进行连接传统的数据库连接池无法满足需求故结合网上的自定义数据库连接池进行的改进代码如下注意由于代码处于公司环境有直接使用肯定是会有报错相信这种简单的修补是
android Stopwatch实例

Stopwatch 实例 package net baisoft stopwatch import java util ArrayList import java util Date import java util HashMap imp
electron vue 打开新窗口

1 主进程 background js文件 const winURL process env NODE ENV development http localhost 8080 file dirname index html 事件名 open
网页设计期末大作业-景点旅游网站（含导航栏，轮播图，样式精美）

景点旅游网站资源链接在文末页设计期末结课的作业样式很精美链接基本正常详细情况入下图所示资源下载链接 https download csdn net download weixin 43474701 85514120
AIX显示版本的最高全包含版本原则

复杂度2 5 机密度4 5 最后更新2021 05 02 专题其它章节说过AIX对所有程序包管理会检验完整性并且内置了一个验证列表包含其所能识别的最新版应当包含的各个程序包的版本如果当前安装的TL Patch不完整则只会显示可以实现
CSS transform属性的简单应用——双开门动画效果

1 效果演示 CSS transform属性有许多效果平移旋转缩放等这里简单应用平移效果实现双开门动画以下为效果图 2 设计思路设置一张居中的需要隐藏的底图设置封面图平分成左右两部分鼠标悬浮在封面图上触发开门效果
在C/C++代码中使用SSE等指令集的指令(4)SSE指令集Intrinsic函数使用

在http blog csdn net gengshenghong article details 7008682里面列举了一些手册其中Intel Intrinsic Guide可以查询到所有的Intrinsic函数对应的汇编指令以及如
centos7的安装和创建用户

1 centos7 2的安装打开安装包之后解压然后双击进入下面的界面选择语言点击下一步 2 然后来到了配置页面可以配置时间选择中国的时区 3 其他的选择默认就好重要的是选择安装类型和磁盘分区 4 选择安装类型一般默认是mi
npm开发微信小程序--使用vantui 详解干货

更新微信开发者工具创建项目 1 创建项目放在一个合适的文件夹中没有APPID时请点击测试号或去注册一个 2 进入项目的根目录 npm init 一路回车要先npm init 初始化项目否则会报错官方文档中没有提到的东东里面有
爬虫实战——58同城租房数据爬取

背景自己本人在暑期时自学了python 还在中国大学mooc上学习了一些爬虫相关的知识对requests库 re库以及BeautifulSoup库有了一定的了解但是没有过爬虫方面的实战刚好家人有这方面需求就对58同城上的租房数据进
简单工厂模式

提示文章写完后目录可以自动生成如何生成可参考右边的帮助文档文章目录前言一创建头文件二创建 c文件 1 cat c 2 dog c 3 person c 三创建main c 四运行结果总结前言工厂模式常用的设计模
《Keras深度学习：入门、实战与进阶》CIFAR-10图像识别

本文摘自 Keras深度学习入门实战与进阶 https item jd com 10038325202263 html 这个数据集由Alex Krizhevsky Vinod Nair和Geoffrey Hinton收集整理共包含了6
cas TicketValidationException 未能够识别出目标 ‘ST-1-UxVA37oEE-qN-S0NNZclYXsXxFQSD-20200510PZSQ‘票根

原因超时了解决去掉debug再测试一下
Hive Transaction(Hive 事务管理)

Hive 事务在 Hive 3 得到增强 hive site xml 配置