Hive基本使用（5）

2023-11-19

三、排序

1. Order By: 全局排序，只有一个Reducer

ASC（ascend）: 升序（默认）
DESC（descend）: 降序

b. ORDER BY 子句在SELECT语句的结尾
demo1：按照工资升序

hive (dyhtest)> select * from emp order by sal;
Query ID = atdyh_20220705230538_47b4d6c8-e5ba-426a-8924-c0807a85e328
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1657033458921_0001, Tracking URL = http://hadoop103:8088/proxy/application_1657033458921_0001/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1657033458921_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-07-05 23:05:51,612 Stage-1 map = 0%,  reduce = 0%
2022-07-05 23:05:59,884 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.83 sec
2022-07-05 23:06:06,052 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.14 sec
MapReduce Total cumulative CPU time: 5 seconds 140 msec
Ended Job = job_1657033458921_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 5.14 sec   HDFS Read: 13396 HDFS Write: 906 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 140 msec
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
NULL	CLERK	7902	NULL	800.00	NULL	20.0	NULL
7900	JAMES	CLERK	7698	1981-12-3	950.0	NULL	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	NULL	20
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7934	MILLER	CLERK	7782	1982-1-23	1300.0	NULL	10
7844	TURNER	SALESMAN	7698	1981-9-8	1500.0	0.0	30
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	NULL	30
7566	JONES	MANAGER	7839	1981-4-2	2975.0	NULL	20
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	NULL	20
7902	FORD	ANALYST	7566	1981-12-3	3000.0	NULL	20
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
Time taken: 28.548 seconds, Fetched: 14 row(s)

demo2:按照工资降序

hive (dyhtest)> select * from emp order by sal desc;
Query ID = atdyh_20220705232041_06712168-6943-49ef-954d-67c02eeac0a5
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1657033458921_0002, Tracking URL = http://hadoop103:8088/proxy/application_1657033458921_0002/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1657033458921_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-07-05 23:20:48,938 Stage-1 map = 0%,  reduce = 0%
2022-07-05 23:20:58,386 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.1 sec
2022-07-05 23:21:05,576 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.51 sec
MapReduce Total cumulative CPU time: 6 seconds 510 msec
Ended Job = job_1657033458921_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.51 sec   HDFS Read: 13479 HDFS Write: 906 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 510 msec
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
7902	FORD	ANALYST	7566	1981-12-3	3000.0	NULL	20
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	NULL	20
7566	JONES	MANAGER	7839	1981-4-2	2975.0	NULL	20
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	NULL	30
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7844	TURNER	SALESMAN	7698	1981-9-8	1500.0	0.0	30
7934	MILLER	CLERK	7782	1982-1-23	1300.0	NULL	10
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	NULL	20
7900	JAMES	CLERK	7698	1981-12-3	950.0	NULL	30
NULL	CLERK	7902	NULL	800.00	NULL	20.0	NULL
Time taken: 24.705 seconds, Fetched: 14 row(s)

hive在做sql解析的时候会根据我们自己写的sql解析map 和reduce任务个数：
在这里插入图片描述
注意:
a. 我们可以通过参数进行设置reduce，但是全局排序reduce只有一个。
b.set mapreduce.job.reduces=-1; //我们跟随任务走，需要几个起几个reduce任务，非-1的值就是实际启动的reduce任务个数
c.在客户端设置（非配置文件等）的时效性是本次session，会话断开，需要在设置

如下：

hive (dyhtest)> set mapreduce.job.reduces=-3;
hive (dyhtest)> select * from emp order by sal desc;
Query ID = atdyh_20220705234313_7a1e822e-2798-4ad8-8d7c-969ed7bb98ab
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1657033458921_0003, Tracking URL = http://hadoop103:8088/proxy/application_1657033458921_0003/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1657033458921_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-07-05 23:43:22,017 Stage-1 map = 0%,  reduce = 0%
2022-07-05 23:43:29,306 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.44 sec
2022-07-05 23:43:34,481 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.03 sec
MapReduce Total cumulative CPU time: 5 seconds 30 msec
Ended Job = job_1657033458921_0003
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 5.03 sec   HDFS Read: 13479 HDFS Write: 906 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 30 msec
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
7902	FORD	ANALYST	7566	1981-12-3	3000.0	NULL	20
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	NULL	20
7566	JONES	MANAGER	7839	1981-4-2	2975.0	NULL	20
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	NULL	30
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7844	TURNER	SALESMAN	7698	1981-9-8	1500.0	0.0	30
7934	MILLER	CLERK	7782	1982-1-23	1300.0	NULL	10
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	NULL	20
7900	JAMES	CLERK	7698	1981-12-3	950.0	NULL	30
NULL	CLERK	7902	NULL	800.00	NULL	20.0	NULL
Time taken: 22.062 seconds, Fetched: 14 row(s)

2.Sort By
a. 对于大规模的数据集order by的效率非常低。在很多情况下，并不需要全局排序，此时可以使用sort by。
b. Sort by为每个reducer产生一个排序文件。每个Reducer内部进行排序，对全局结果集来说不是排序。

---设置reduce任务个数
hive (dyhtest)> set mapreduce.job.reduces=3;
--- 查看reduce任务设置的个数
hive (dyhtest)> set mapreduce.job.reduces;
mapreduce.job.reduces=3

--- 跑任务验证
hive (dyhtest)> select * from emp sort by deptno desc;
Query ID = atdyh_20220711172431_1e75f23e-3e66-414e-a52b-293c241febd6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 3
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1657525216716_0001, Tracking URL = http://hadoop103:8088/proxy/application_1657525216716_0001/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1657525216716_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
2022-07-11 17:24:50,409 Stage-1 map = 0%,  reduce = 0%
2022-07-11 17:24:58,815 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.09 sec
2022-07-11 17:25:09,084 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 10.43 sec
2022-07-11 17:25:14,357 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 18.82 sec
2022-07-11 17:25:15,385 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 28.02 sec
MapReduce Total cumulative CPU time: 28 seconds 20 msec
Ended Job = job_1657525216716_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 3   Cumulative CPU: 28.02 sec   HDFS Read: 26550 HDFS Write: 1080 SUCCESS
Total MapReduce CPU Time Spent: 28 seconds 20 msec
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7844	TURNER	SALESMAN	7698	1981-9-8	1500.0	0.0	30
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	NULL	30
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	NULL	20
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7900	JAMES	CLERK	7698	1981-12-3	950.0	NULL	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	NULL	20
7566	JONES	MANAGER	7839	1981-4-2	2975.0	NULL	20
7934	MILLER	CLERK	7782	1982-1-23	1300.0	NULL	10
7902	FORD	ANALYST	7566	1981-12-3	3000.0	NULL	20
NULL	CLERK	7902	NULL	800.00	NULL	20.0	NULL
Time taken: 44.645 seconds, Fetched: 14 row(s)

3.Distribute By
a. 在有些情况下，我们需要控制某个特定行应该到哪个reducer，通常是为了进行后续的聚集操作，distribute by 子句可以做这件事。
b.distribute by类似MR中partition（自定义分区），进行分区，结合sort by使用。
c.distribute by的分区规则是根据分区字段的hash码与reduce的个数进行模除后，余数相同的分到一个区。
d.Hive要求DISTRIBUTE BY语句要写在SORT BY语句之前。

demo :先按照部门编号分区，再按照员工编号降序排序。

--- 创建输出文件夹
[atdyh@hadoop102 datas]$ mkdir sort—by
[atdyh@hadoop102 sort—by]$ pwd
/opt/module/hive-3.1.2/datas/sort—by

--- 把分区结果放到文件夹中
hive (dyhtest)> insert overwrite local directory '/opt/module/hive-3.1.2/datas/sort—by' select * from emp distribute by deptno sort by empno desc;
Query ID = atdyh_20220711180613_efe25699-1bd3-49f7-876e-3a7d6149ebdc
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 3
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1657525216716_0003, Tracking URL = http://hadoop103:8088/proxy/application_1657525216716_0003/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1657525216716_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
2022-07-11 18:06:19,136 Stage-1 map = 0%,  reduce = 0%
2022-07-11 18:06:26,307 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.04 sec
2022-07-11 18:06:36,916 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 9.75 sec
2022-07-11 18:06:37,943 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 25.68 sec
MapReduce Total cumulative CPU time: 25 seconds 680 msec
Ended Job = job_1657525216716_0003
Moving data to local directory /opt/module/hive-3.1.2/datas/sort—by
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 3   Cumulative CPU: 25.68 sec   HDFS Read: 25601 HDFS Write: 651 SUCCESS
Total MapReduce CPU Time Spent: 25 seconds 680 msec
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
Time taken: 25.687 seconds

3.Cluster By
a. 当distribute by和sort by字段相同时，可以使用cluster by方式。
b. cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是升序排序，不能指定排序规则为ASC或者DESC。

--- 这两个sql效果一样
hive (default)> select * from emp cluster by deptno;
hive (default)> select * from emp distribute by deptno sort by deptno;

注意：按照部门编号分区，不一定就是固定死的数值，可以是20号和30号部门分到一个分区里面去。

四、分区表

a. 特点：

分区表实际上就是对应一个HDFS文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。
Hive中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集。
在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区，这样的查询效率会提高很多。

b.分区表创建语法：
create table 表名（符合公司规范）（
字段名1 字段类型,
字段名2 字段类型,
…
）
partitioned by (分区字段分区字段类型)
row format delimited fields terminated by ‘\t’;

--- 创建分区表
hive (dyhtest)> create table dept_partition (
              >   deptno int ,
              >   dname string,
              >   loc string 
              > ) 
              > partitioned by (day string ) 
              > row format delimited fields terminated by '\t' ;
OK
Time taken: 4.429 seconds

--- 准备数据
[atdyh@hadoop102 patitionedby]$ vim 2022-7-12
[atdyh@hadoop102 patitionedby]$ vim 2022-7-13
[atdyh@hadoop102 patitionedby]$ vim 2022-7-14

---往分区表插入数据
hive (dyhtest)> load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-12' into table dept_partition partition(day='2022-7-12');
Loading data to table dyhtest.dept_partition partition (day=2022-7-12)
OK
Time taken: 1.989 seconds
hive (dyhtest)> load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-13' into table dept_partition partition(day='2022-7-13');
Loading data to table dyhtest.dept_partition partition (day=2022-7-13)
OK
Time taken: 0.617 seconds
hive (dyhtest)> load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-14' into table dept_partition partition(day='2022-7-14');
Loading data to table dyhtest.dept_partition partition (day=2022-7-14)
OK
Time taken: 0.548 seconds

分区表对应hdfs路径：
在这里插入图片描述
c .分区操作：
查看分区：

hive (dyhtest)> show partitions dept_partition;
OK
partition
day=2022-7-12
day=2022-7-13
day=2022-7-14
Time taken: 0.243 seconds, Fetched: 3 row(s)

添加分区：

hive (dyhtest)> alter table dept_partition add partition(day ='2022-07-15');
OK
Time taken: 0.292 seconds
hive (dyhtest)> show partitions dept_partition;
OK
partition
day=2022-07-15
day=2022-7-12
day=2022-7-13
day=2022-7-14
Time taken: 0.149 seconds, Fetched: 4 row(s)

删除分区：

hive (dyhtest)> alter table dept_partition drop partition(day ='2022-07-15');
Dropped the partition day=2022-07-15
OK
Time taken: 1.35 seconds
hive (dyhtest)> show partitions dept_partition;
OK
partition
day=2022-7-12
day=2022-7-13
day=2022-7-14
Time taken: 0.113 seconds, Fetched: 3 row(s)

**注意:**如果一次删除多个分区，需要在分区之前添加逗号

加分区:
alter table dept_partition add partition(day=‘2020-04-04’);
alter table dept_partition add partition(day=‘2020-04-05’) partition(day=‘2020-04-06’) ;

删分区:
alter table dept_partition drop partition(day=‘20200404’);
alter table dept_partition drop partition(day=‘20200405’) ,partition(day=‘20200406’);

d.二级（多级）分区：
创建二级分区表：

hive (dyhtest)> create table dept_partition2(
              >   deptno int, 
              >   dname string, 
              >   loc string 
              > )
              > partitioned by (day string ,hour string )
              > row format delimited fields terminated by '\t' ;
OK
Time taken: 0.193 seconds

加载数据：

hive (dyhtest)> 
              > load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-13'  
              >                    into table dept_partition2 partition(day='2022-7-12',hour='13') ;
Loading data to table dyhtest.dept_partition2 partition (day=2022-7-12, hour=13)
OK
Time taken: 0.42 seconds
hive (dyhtest)> 
              > load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-14'  
              >                    into table dept_partition2 partition(day='2022-7-12',hour='14') ;
Loading data to table dyhtest.dept_partition2 partition (day=2022-7-12, hour=14)
OK
Time taken: 0.556 seconds
hive (dyhtest)> 
              > load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-12'  
              >                    into table dept_partition2 partition(day='2022-7-13',hour='07') ; 
Loading data to table dyhtest.dept_partition2 partition (day=2022-7-13, hour=07)
OK
Time taken: 0.462 seconds
hive (dyhtest)> 
              > load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-13'  
              >                    into table dept_partition2 partition(day='2022-7-13',hour='08') ;
Loading data to table dyhtest.dept_partition2 partition (day=2022-7-13, hour=08)
OK
Time taken: 0.421 seconds
hive (dyhtest)> 
              > load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-14'  
              >                    into table dept_partition2 partition(day='2022-7-13',hour='09') ;       
Loading data to table dyhtest.dept_partition2 partition (day=2022-7-13, hour=09)
OK
Time taken: 0.441 seconds

在这里插入图片描述
d.数据和分区表关联的方式

手动创建分区目录，手动上传数据 , 进行分区修复操作

--- 创建hdfs目录
[atdyh@hadoop102 patitionedby]$ hadoop fs -mkdir /user/hive/warehouse/dyhtest.db/dept_partition/day=2022-7-15

--- 在文件目录下上传数据
[atdyh@hadoop102 patitionedby]$ hadoop fs -put /opt/module/hive-3.1.2/datas/patitionedby/2022-7-12  /user/hive/warehouse/dyhtest.db/dept_partition/day=2022-7-15
2022-07-13 01:26:27,861 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false

--- 修复分区
hive (dyhtest)> show partitions dept_partition;
OK
partition
day=2022-7-12
day=2022-7-13
day=2022-7-14
Time taken: 0.221 seconds, Fetched: 3 row(s)
hive (dyhtest)> msck repair table dept_partition;
OK
Partitions not in metastore:	dept_partition:day=2022-7-15
Repair: Added partition to metastore dept_partition:day=2022-7-15
Time taken: 0.593 seconds, Fetched: 2 row(s)
hive (dyhtest)> show partitions dept_partition;
OK
partition
day=2022-7-12
day=2022-7-13
day=2022-7-14
day=2022-7-15
Time taken: 0.121 seconds, Fetched: 4 row(s)

注意：修复分区命令经常会用到（msck repair table 表名;）

手动创建分区目录，手动上传数据 , 手动添加分区

--- 创建hdfs目录
[atdyh@hadoop102 patitionedby]$ hadoop fs -mkdir /user/hive/warehouse/dyhtest.db/dept_partition/day=2022-7-16

--- 上传数据
[atdyh@hadoop102 patitionedby]$ hadoop fs -put /opt/module/hive-3.1.2/datas/patitionedby/2022-7-12  /user/hive/warehouse/dyhtest.db/dept_partition/day=2022-7-16
2022-07-13 01:44:32,687 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false

--- 添加分区
hive (dyhtest)> alter table dept_partition add partition(day='2022-7-16');
OK
Time taken: 0.435 seconds
hive (dyhtest)> show partitions dept_partition;
OK
partition
day=2022-7-12
day=2022-7-13
day=2022-7-14
day=2022-7-15
day=2022-7-16
Time taken: 0.108 seconds, Fetched: 5 row(s)

注意： 添加分区语法经常用到（alter table 表名 add partition （分区字段= ‘分区字段值’））

--- 创建hdfs目录
[atdyh@hadoop102 patitionedby]$ hadoop fs -mkdir /user/hive/warehouse/dyhtest.db/dept_partition/day=2022-7-17

--- 加载数据
hive (dyhtest)> load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-12'  
              >                    into table dept_partition partition(day='2022-7-17') ; 
Loading data to table dyhtest.dept_partition partition (day=2022-7-17)
OK
Time taken: 2.476 seconds

e. 动态分区

关系型数据库中，对分区表Insert数据时候，数据库自动会根据分区字段的值，将数据插入到相应的分区中
Hive中也提供了类似的机制，即动态分区(Dynamic Partition)，只不过，使用Hive的动态分区，需要进行相应的配置。

--- 默认的是开启的，
hive (dyhtest)> set hive.exec.dynamic.partition;
hive.exec.dynamic.partition=true

--- 设置为非严格模式（动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。）

--- 查看默认模式
hive (dyhtest)> set hive.exec.dynamic.partition.mode;
hive.exec.dynamic.partition.mode=strict
--- 设置非严格模式
hive (dyhtest)> set hive.exec.dynamic.partition.mode=nostrict;
--- 查看设置后的模式
hive (dyhtest)> set hive.exec.dynamic.partition.mode;
hive.exec.dynamic.partition.mode=nostrict

补充：

在所有执行MR的节点上，最大一共可以创建多少个动态分区。默认1000
hive.exec.max.dynamic.partitions=1000
在每个执行MR的节点上，最大可以创建多少个动态分区。该参数需要根据实际的数据来设定。比如：源数据中包含了一年的数据，即day字段有365个值，那么该参数就需要设置成大于365，如果使用默认值100，则会报错。
hive.exec.max.dynamic.partitions.pernode=100
整个MR Job中，最大可以创建多少个HDFS文件。默认100000
hive.exec.max.created.files=100000
当有空分区生成时，是否抛出异常。一般不需要设置。默认false
hive.error.on.empty.partition=false

以上几个参数按需选择

demo：

--- 准备数据
[atdyh@hadoop102 datas]$ cp dept.txt dept_partition_dy.txt 
[atdyh@hadoop102 datas]$ ll
总用量 52
-rw-rw-r--. 1 atdyh atdyh   69 7月  13 10:29 dept_partition_dy.txt
	
--- load数据
hive (dyhtest)> load data local inpath 'opt/module/hive-3.1.2/datas/dept_partition_dy.txt' into table dept_partition_dy;
Query ID = atdyh_20220713104944_6f5aa5d6-4f15-4725-b75e-7fe273fc97a4
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1657525216716_0004, Tracking URL = http://hadoop103:8088/proxy/application_1657525216716_0004/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1657525216716_0004
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 1
2022-07-13 10:50:06,384 Stage-1 map = 0%,  reduce = 0%
2022-07-13 10:50:17,880 Stage-1 map = 0%,  reduce = 100%, Cumulative CPU 5.36 sec
MapReduce Total cumulative CPU time: 5 seconds 360 msec
Ended Job = job_1657525216716_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://hadoop102:9820/user/hive/warehouse/dyhtest.db/dept_partition_dy/.hive-staging_hive_2022-07-13_10-49-45_694_2569470323021587088-1/-ext-10000
Loading data to table dyhtest.dept_partition_dy partition (day=null)


	 Time taken to load dynamic partitions: 0.043 seconds
	 Time taken for adding to write entity : 0.0 seconds
MapReduce Jobs Launched: 
Stage-Stage-1: Reduce: 1   Cumulative CPU: 5.36 sec   HDFS Read: 8981 HDFS Write: 87 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 360 msec
OK
dept_partition_dy__temp_table_for_load_data__.deptno	dept_partition_dy__temp_table_for_load_data__.dname	dept_partition_dy__temp_table_for_load_data__.loc	dept_partition_dy__temp_table_for_load_data__.day
Time taken: 34.843 seconds

注意： 加载数据的时候文件需要放到hdfs上，否则会出现文件找不到的问题

五、分桶表

分区提供一个隔离数据和优化查询的便利方式。并非所有的数据集都可形成合理的分区。
对于一张表或者分区，Hive 可以进一步组织成桶，也就是更为细粒度的数据范围划分。
分桶是将数据集分解成更容易管理的若干部分的另一个技术。
分区针对的是数据的存储路径；分桶针对的是数据文件。
demo：

--- 准备数据
[atdyh@hadoop102 datas]$ hadoop fs -put student.txt /
2022-07-13 12:16:19,414 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false

--- 创建分桶表
hive (dyhtest)> create table student_bucket(
              >   id int, 
              >   name string 
              > )
              > clustered by (id)
              > into 4 buckets 
              > row format delimited fields terminated by '\t' ;   
OK
Time taken: 0.111 seconds

--- 查看表结构


--- 加载数据
hive (dyhtest)> load  data  inpath '/student.txt' into table student_bucket ;
Query ID = atdyh_20220713121702_9e78cb07-fdfe-4745-8c9e-02811b925b44
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1657525216716_0008, Tracking URL = http://hadoop103:8088/proxy/application_1657525216716_0008/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1657525216716_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
2022-07-13 12:17:14,891 Stage-1 map = 0%,  reduce = 0%
2022-07-13 12:17:22,143 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.61 sec
2022-07-13 12:17:31,862 Stage-1 map = 100%,  reduce = 25%, Cumulative CPU 12.09 sec
2022-07-13 12:17:36,310 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 21.05 sec
2022-07-13 12:17:37,396 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 49.33 sec
MapReduce Total cumulative CPU time: 49 seconds 330 msec
Ended Job = job_1657525216716_0008
Loading data to table dyhtest.student_bucket
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1657525216716_0009, Tracking URL = http://hadoop103:8088/proxy/application_1657525216716_0009/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1657525216716_0009
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2022-07-13 12:17:52,462 Stage-3 map = 0%,  reduce = 0%
2022-07-13 12:17:57,937 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 2.65 sec
2022-07-13 12:18:04,327 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 5.67 sec
MapReduce Total cumulative CPU time: 5 seconds 670 msec
Ended Job = job_1657525216716_0009
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 4   Cumulative CPU: 49.33 sec   HDFS Read: 33366 HDFS Write: 1249 SUCCESS
Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 5.67 sec   HDFS Read: 12156 HDFS Write: 335 SUCCESS
Total MapReduce CPU Time Spent: 55 seconds 0 msec
OK
student_bucket__temp_table_for_load_data__.id	student_bucket__temp_table_for_load_data__.name
Time taken: 62.976 seconds

注意：

Hive的分桶采用对分桶字段的值进行哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中
reduce的个数设置为-1,让Job自行决定需要用多少个reduce或者将reduce的个数设置为大于等于分桶表的桶数
从hdfs中load数据到分桶表中，避免本地文件找不到问题
不要使用本地模式
5.分区是分的文件目录，分桶是分的整个数据文件

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

hive