三、排序
1. Order By: 全局排序,只有一个Reducer
ASC(ascend): 升序(默认)
DESC(descend): 降序
b. ORDER BY 子句在SELECT语句的结尾
demo1:按照工资升序
hive (dyhtest)> select * from emp order by sal;
Query ID = atdyh_20220705230538_47b4d6c8-e5ba-426a-8924-c0807a85e328
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1657033458921_0001, Tracking URL = http://hadoop103:8088/proxy/application_1657033458921_0001/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job -kill job_1657033458921_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-07-05 23:05:51,612 Stage-1 map = 0%, reduce = 0%
2022-07-05 23:05:59,884 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.83 sec
2022-07-05 23:06:06,052 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.14 sec
MapReduce Total cumulative CPU time: 5 seconds 140 msec
Ended Job = job_1657033458921_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.14 sec HDFS Read: 13396 HDFS Write: 906 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 140 msec
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
NULL CLERK 7902 NULL 800.00 NULL 20.0 NULL
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
Time taken: 28.548 seconds, Fetched: 14 row(s)
demo2:按照工资降序
hive (dyhtest)> select * from emp order by sal desc;
Query ID = atdyh_20220705232041_06712168-6943-49ef-954d-67c02eeac0a5
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1657033458921_0002, Tracking URL = http://hadoop103:8088/proxy/application_1657033458921_0002/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job -kill job_1657033458921_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-07-05 23:20:48,938 Stage-1 map = 0%, reduce = 0%
2022-07-05 23:20:58,386 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.1 sec
2022-07-05 23:21:05,576 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.51 sec
MapReduce Total cumulative CPU time: 6 seconds 510 msec
Ended Job = job_1657033458921_0002
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.51 sec HDFS Read: 13479 HDFS Write: 906 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 510 msec
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
NULL CLERK 7902 NULL 800.00 NULL 20.0 NULL
Time taken: 24.705 seconds, Fetched: 14 row(s)
hive在做sql解析的时候会根据我们自己写的sql解析map 和reduce任务个数:
注意:
a. 我们可以通过参数进行设置reduce,但是全局排序reduce只有一个。
b.set mapreduce.job.reduces=-1; //我们跟随任务走,需要几个起几个reduce任务,非-1的值就是实际启动的reduce任务个数
c.在客户端设置(非配置文件等)的时效性是本次session,会话断开,需要在设置
如下:
hive (dyhtest)> set mapreduce.job.reduces=-3;
hive (dyhtest)> select * from emp order by sal desc;
Query ID = atdyh_20220705234313_7a1e822e-2798-4ad8-8d7c-969ed7bb98ab
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1657033458921_0003, Tracking URL = http://hadoop103:8088/proxy/application_1657033458921_0003/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job -kill job_1657033458921_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-07-05 23:43:22,017 Stage-1 map = 0%, reduce = 0%
2022-07-05 23:43:29,306 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.44 sec
2022-07-05 23:43:34,481 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.03 sec
MapReduce Total cumulative CPU time: 5 seconds 30 msec
Ended Job = job_1657033458921_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.03 sec HDFS Read: 13479 HDFS Write: 906 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 30 msec
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
NULL CLERK 7902 NULL 800.00 NULL 20.0 NULL
Time taken: 22.062 seconds, Fetched: 14 row(s)
2.Sort By
a. 对于大规模的数据集order by的效率非常低。在很多情况下,并不需要全局排序,此时可以使用sort by。
b. Sort by为每个reducer产生一个排序文件。每个Reducer内部进行排序,对全局结果集来说不是排序。
---设置reduce任务个数
hive (dyhtest)> set mapreduce.job.reduces=3;
--- 查看reduce任务设置的个数
hive (dyhtest)> set mapreduce.job.reduces;
mapreduce.job.reduces=3
--- 跑任务验证
hive (dyhtest)> select * from emp sort by deptno desc;
Query ID = atdyh_20220711172431_1e75f23e-3e66-414e-a52b-293c241febd6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 3
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1657525216716_0001, Tracking URL = http://hadoop103:8088/proxy/application_1657525216716_0001/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job -kill job_1657525216716_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
2022-07-11 17:24:50,409 Stage-1 map = 0%, reduce = 0%
2022-07-11 17:24:58,815 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.09 sec
2022-07-11 17:25:09,084 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 10.43 sec
2022-07-11 17:25:14,357 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 18.82 sec
2022-07-11 17:25:15,385 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 28.02 sec
MapReduce Total cumulative CPU time: 28 seconds 20 msec
Ended Job = job_1657525216716_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 3 Cumulative CPU: 28.02 sec HDFS Read: 26550 HDFS Write: 1080 SUCCESS
Total MapReduce CPU Time Spent: 28 seconds 20 msec
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
NULL CLERK 7902 NULL 800.00 NULL 20.0 NULL
Time taken: 44.645 seconds, Fetched: 14 row(s)
3.Distribute By
a. 在有些情况下,我们需要控制某个特定行应该到哪个reducer,通常是为了进行后续的聚集操作,distribute by 子句可以做这件事。
b.distribute by类似MR中partition(自定义分区),进行分区,结合sort by使用。
c.distribute by的分区规则是根据分区字段的hash码与reduce的个数进行模除后,余数相同的分到一个区。
d.Hive要求DISTRIBUTE BY语句要写在SORT BY语句之前。
demo :先按照部门编号分区,再按照员工编号降序排序。
--- 创建输出文件夹
[atdyh@hadoop102 datas]$ mkdir sort—by
[atdyh@hadoop102 sort—by]$ pwd
/opt/module/hive-3.1.2/datas/sort—by
--- 把分区结果放到文件夹中
hive (dyhtest)> insert overwrite local directory '/opt/module/hive-3.1.2/datas/sort—by' select * from emp distribute by deptno sort by empno desc;
Query ID = atdyh_20220711180613_efe25699-1bd3-49f7-876e-3a7d6149ebdc
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 3
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1657525216716_0003, Tracking URL = http://hadoop103:8088/proxy/application_1657525216716_0003/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job -kill job_1657525216716_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
2022-07-11 18:06:19,136 Stage-1 map = 0%, reduce = 0%
2022-07-11 18:06:26,307 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.04 sec
2022-07-11 18:06:36,916 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 9.75 sec
2022-07-11 18:06:37,943 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 25.68 sec
MapReduce Total cumulative CPU time: 25 seconds 680 msec
Ended Job = job_1657525216716_0003
Moving data to local directory /opt/module/hive-3.1.2/datas/sort—by
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 3 Cumulative CPU: 25.68 sec HDFS Read: 25601 HDFS Write: 651 SUCCESS
Total MapReduce CPU Time Spent: 25 seconds 680 msec
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
Time taken: 25.687 seconds
3.Cluster By
a. 当distribute by和sort by字段相同时,可以使用cluster by方式。
b. cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是升序排序,不能指定排序规则为ASC或者DESC。
--- 这两个sql效果一样
hive (default)> select * from emp cluster by deptno;
hive (default)> select * from emp distribute by deptno sort by deptno;
注意:按照部门编号分区,不一定就是固定死的数值,可以是20号和30号部门分到一个分区里面去。
四、分区表
a. 特点:
-
分区表实际上就是对应一个HDFS文件系统上的独立的文件夹,该文件夹下是该分区所有的数据文件。
- Hive中的分区就是分目录,把一个大的数据集根据业务需要分割成小的数据集。
- 在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区,这样的查询效率会提高很多。
b.分区表创建语法:
create table 表名(符合公司规范)(
字段名1 字段类型,
字段名2 字段类型,
…
)
partitioned by (分区字段 分区字段类型)
row format delimited fields terminated by ‘\t’;
--- 创建分区表
hive (dyhtest)> create table dept_partition (
> deptno int ,
> dname string,
> loc string
> )
> partitioned by (day string )
> row format delimited fields terminated by '\t' ;
OK
Time taken: 4.429 seconds
--- 准备数据
[atdyh@hadoop102 patitionedby]$ vim 2022-7-12
[atdyh@hadoop102 patitionedby]$ vim 2022-7-13
[atdyh@hadoop102 patitionedby]$ vim 2022-7-14
---往分区表插入数据
hive (dyhtest)> load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-12' into table dept_partition partition(day='2022-7-12');
Loading data to table dyhtest.dept_partition partition (day=2022-7-12)
OK
Time taken: 1.989 seconds
hive (dyhtest)> load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-13' into table dept_partition partition(day='2022-7-13');
Loading data to table dyhtest.dept_partition partition (day=2022-7-13)
OK
Time taken: 0.617 seconds
hive (dyhtest)> load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-14' into table dept_partition partition(day='2022-7-14');
Loading data to table dyhtest.dept_partition partition (day=2022-7-14)
OK
Time taken: 0.548 seconds
分区表对应hdfs路径:
c .分区操作:
查看分区:
hive (dyhtest)> show partitions dept_partition;
OK
partition
day=2022-7-12
day=2022-7-13
day=2022-7-14
Time taken: 0.243 seconds, Fetched: 3 row(s)
添加分区:
hive (dyhtest)> alter table dept_partition add partition(day ='2022-07-15');
OK
Time taken: 0.292 seconds
hive (dyhtest)> show partitions dept_partition;
OK
partition
day=2022-07-15
day=2022-7-12
day=2022-7-13
day=2022-7-14
Time taken: 0.149 seconds, Fetched: 4 row(s)
删除分区 :
hive (dyhtest)> alter table dept_partition drop partition(day ='2022-07-15');
Dropped the partition day=2022-07-15
OK
Time taken: 1.35 seconds
hive (dyhtest)> show partitions dept_partition;
OK
partition
day=2022-7-12
day=2022-7-13
day=2022-7-14
Time taken: 0.113 seconds, Fetched: 3 row(s)
**注意:**如果一次删除多个分区,需要在分区之前添加逗号
加分区:
alter table dept_partition add partition(day=‘2020-04-04’);
alter table dept_partition add partition(day=‘2020-04-05’) partition(day=‘2020-04-06’) ;
删分区:
alter table dept_partition drop partition(day=‘20200404’);
alter table dept_partition drop partition(day=‘20200405’) ,partition(day=‘20200406’);
d.二级(多级)分区:
创建二级分区表:
hive (dyhtest)> create table dept_partition2(
> deptno int,
> dname string,
> loc string
> )
> partitioned by (day string ,hour string )
> row format delimited fields terminated by '\t' ;
OK
Time taken: 0.193 seconds
加载数据:
hive (dyhtest)>
> load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-13'
> into table dept_partition2 partition(day='2022-7-12',hour='13') ;
Loading data to table dyhtest.dept_partition2 partition (day=2022-7-12, hour=13)
OK
Time taken: 0.42 seconds
hive (dyhtest)>
> load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-14'
> into table dept_partition2 partition(day='2022-7-12',hour='14') ;
Loading data to table dyhtest.dept_partition2 partition (day=2022-7-12, hour=14)
OK
Time taken: 0.556 seconds
hive (dyhtest)>
> load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-12'
> into table dept_partition2 partition(day='2022-7-13',hour='07') ;
Loading data to table dyhtest.dept_partition2 partition (day=2022-7-13, hour=07)
OK
Time taken: 0.462 seconds
hive (dyhtest)>
> load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-13'
> into table dept_partition2 partition(day='2022-7-13',hour='08') ;
Loading data to table dyhtest.dept_partition2 partition (day=2022-7-13, hour=08)
OK
Time taken: 0.421 seconds
hive (dyhtest)>
> load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-14'
> into table dept_partition2 partition(day='2022-7-13',hour='09') ;
Loading data to table dyhtest.dept_partition2 partition (day=2022-7-13, hour=09)
OK
Time taken: 0.441 seconds
d.数据和分区表关联的方式
- 手动创建分区目录,手动上传数据 , 进行分区修复操作
--- 创建hdfs目录
[atdyh@hadoop102 patitionedby]$ hadoop fs -mkdir /user/hive/warehouse/dyhtest.db/dept_partition/day=2022-7-15
--- 在文件目录下上传数据
[atdyh@hadoop102 patitionedby]$ hadoop fs -put /opt/module/hive-3.1.2/datas/patitionedby/2022-7-12 /user/hive/warehouse/dyhtest.db/dept_partition/day=2022-7-15
2022-07-13 01:26:27,861 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
--- 修复分区
hive (dyhtest)> show partitions dept_partition;
OK
partition
day=2022-7-12
day=2022-7-13
day=2022-7-14
Time taken: 0.221 seconds, Fetched: 3 row(s)
hive (dyhtest)> msck repair table dept_partition;
OK
Partitions not in metastore: dept_partition:day=2022-7-15
Repair: Added partition to metastore dept_partition:day=2022-7-15
Time taken: 0.593 seconds, Fetched: 2 row(s)
hive (dyhtest)> show partitions dept_partition;
OK
partition
day=2022-7-12
day=2022-7-13
day=2022-7-14
day=2022-7-15
Time taken: 0.121 seconds, Fetched: 4 row(s)
注意:修复分区命令经常会用到 (msck repair table 表名;)
--- 创建hdfs目录
[atdyh@hadoop102 patitionedby]$ hadoop fs -mkdir /user/hive/warehouse/dyhtest.db/dept_partition/day=2022-7-16
--- 上传数据
[atdyh@hadoop102 patitionedby]$ hadoop fs -put /opt/module/hive-3.1.2/datas/patitionedby/2022-7-12 /user/hive/warehouse/dyhtest.db/dept_partition/day=2022-7-16
2022-07-13 01:44:32,687 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
--- 添加分区
hive (dyhtest)> alter table dept_partition add partition(day='2022-7-16');
OK
Time taken: 0.435 seconds
hive (dyhtest)> show partitions dept_partition;
OK
partition
day=2022-7-12
day=2022-7-13
day=2022-7-14
day=2022-7-15
day=2022-7-16
Time taken: 0.108 seconds, Fetched: 5 row(s)
注意: 添加分区语法经常用到(alter table 表名 add partition (分区字段= ‘分区字段值’))
--- 创建hdfs目录
[atdyh@hadoop102 patitionedby]$ hadoop fs -mkdir /user/hive/warehouse/dyhtest.db/dept_partition/day=2022-7-17
--- 加载数据
hive (dyhtest)> load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-12'
> into table dept_partition partition(day='2022-7-17') ;
Loading data to table dyhtest.dept_partition partition (day=2022-7-17)
OK
Time taken: 2.476 seconds
e. 动态分区
- 关系型数据库中,对分区表Insert数据时候,数据库自动会根据分区字段的值,将数据插入到相应的分区中
- Hive中也提供了类似的机制,即动态分区(Dynamic Partition),只不过,使用Hive的动态分区,需要进行相应的配置。
--- 默认的是开启的,
hive (dyhtest)> set hive.exec.dynamic.partition;
hive.exec.dynamic.partition=true
--- 设置为非严格模式(动态分区的模式,默认strict,表示必须指定至少一个分区为静态分区,nonstrict模式表示允许所有的分区字段都可以使用动态分区。)
--- 查看默认模式
hive (dyhtest)> set hive.exec.dynamic.partition.mode;
hive.exec.dynamic.partition.mode=strict
--- 设置非严格模式
hive (dyhtest)> set hive.exec.dynamic.partition.mode=nostrict;
--- 查看设置后的模式
hive (dyhtest)> set hive.exec.dynamic.partition.mode;
hive.exec.dynamic.partition.mode=nostrict
补充:
- 在所有执行MR的节点上,最大一共可以创建多少个动态分区。默认1000
hive.exec.max.dynamic.partitions=1000
- 在每个执行MR的节点上,最大可以创建多少个动态分区。该参数需要根据实际的数据来设定。比如:源数据中包含了一年的数据,即day字段有365个值,那么该参数就需要设置成大于365,如果使用默认值100,则会报错。
hive.exec.max.dynamic.partitions.pernode=100
- 整个MR Job中,最大可以创建多少个HDFS文件。默认100000
hive.exec.max.created.files=100000
- 当有空分区生成时,是否抛出异常。一般不需要设置。默认false
hive.error.on.empty.partition=false
以上几个参数按需选择
demo:
--- 准备数据
[atdyh@hadoop102 datas]$ cp dept.txt dept_partition_dy.txt
[atdyh@hadoop102 datas]$ ll
总用量 52
-rw-rw-r--. 1 atdyh atdyh 69 7月 13 10:29 dept_partition_dy.txt
--- load数据
hive (dyhtest)> load data local inpath 'opt/module/hive-3.1.2/datas/dept_partition_dy.txt' into table dept_partition_dy;
Query ID = atdyh_20220713104944_6f5aa5d6-4f15-4725-b75e-7fe273fc97a4
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1657525216716_0004, Tracking URL = http://hadoop103:8088/proxy/application_1657525216716_0004/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job -kill job_1657525216716_0004
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 1
2022-07-13 10:50:06,384 Stage-1 map = 0%, reduce = 0%
2022-07-13 10:50:17,880 Stage-1 map = 0%, reduce = 100%, Cumulative CPU 5.36 sec
MapReduce Total cumulative CPU time: 5 seconds 360 msec
Ended Job = job_1657525216716_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://hadoop102:9820/user/hive/warehouse/dyhtest.db/dept_partition_dy/.hive-staging_hive_2022-07-13_10-49-45_694_2569470323021587088-1/-ext-10000
Loading data to table dyhtest.dept_partition_dy partition (day=null)
Time taken to load dynamic partitions: 0.043 seconds
Time taken for adding to write entity : 0.0 seconds
MapReduce Jobs Launched:
Stage-Stage-1: Reduce: 1 Cumulative CPU: 5.36 sec HDFS Read: 8981 HDFS Write: 87 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 360 msec
OK
dept_partition_dy__temp_table_for_load_data__.deptno dept_partition_dy__temp_table_for_load_data__.dname dept_partition_dy__temp_table_for_load_data__.loc dept_partition_dy__temp_table_for_load_data__.day
Time taken: 34.843 seconds
注意: 加载数据的时候文件需要放到hdfs上,否则会出现文件找不到的问题
五、分桶表
- 分区提供一个隔离数据和优化查询的便利方式。并非所有的数据集都可形成合理的分区。
- 对于一张表或者分区,Hive 可以进一步组织成桶,也就是更为细粒度的数据范围划分。
- 分桶是将数据集分解成更容易管理的若干部分的另一个技术。
-
分区针对的是数据的存储路径;分桶针对的是数据文件。
demo:
--- 准备数据
[atdyh@hadoop102 datas]$ hadoop fs -put student.txt /
2022-07-13 12:16:19,414 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
--- 创建分桶表
hive (dyhtest)> create table student_bucket(
> id int,
> name string
> )
> clustered by (id)
> into 4 buckets
> row format delimited fields terminated by '\t' ;
OK
Time taken: 0.111 seconds
--- 查看表结构
--- 加载数据
hive (dyhtest)> load data inpath '/student.txt' into table student_bucket ;
Query ID = atdyh_20220713121702_9e78cb07-fdfe-4745-8c9e-02811b925b44
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1657525216716_0008, Tracking URL = http://hadoop103:8088/proxy/application_1657525216716_0008/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job -kill job_1657525216716_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
2022-07-13 12:17:14,891 Stage-1 map = 0%, reduce = 0%
2022-07-13 12:17:22,143 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.61 sec
2022-07-13 12:17:31,862 Stage-1 map = 100%, reduce = 25%, Cumulative CPU 12.09 sec
2022-07-13 12:17:36,310 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 21.05 sec
2022-07-13 12:17:37,396 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 49.33 sec
MapReduce Total cumulative CPU time: 49 seconds 330 msec
Ended Job = job_1657525216716_0008
Loading data to table dyhtest.student_bucket
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1657525216716_0009, Tracking URL = http://hadoop103:8088/proxy/application_1657525216716_0009/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job -kill job_1657525216716_0009
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2022-07-13 12:17:52,462 Stage-3 map = 0%, reduce = 0%
2022-07-13 12:17:57,937 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.65 sec
2022-07-13 12:18:04,327 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 5.67 sec
MapReduce Total cumulative CPU time: 5 seconds 670 msec
Ended Job = job_1657525216716_0009
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 49.33 sec HDFS Read: 33366 HDFS Write: 1249 SUCCESS
Stage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 5.67 sec HDFS Read: 12156 HDFS Write: 335 SUCCESS
Total MapReduce CPU Time Spent: 55 seconds 0 msec
OK
student_bucket__temp_table_for_load_data__.id student_bucket__temp_table_for_load_data__.name
Time taken: 62.976 seconds
注意:
- Hive的分桶采用对分桶字段的值进行哈希,然后除以桶的个数求余的方 式决定该条记录存放在哪个桶当中
- reduce的个数设置为-1,让Job自行决定需要用多少个reduce或者将reduce的个数设置为大于等于分桶表的桶数
- 从hdfs中load数据到分桶表中,避免本地文件找不到问题
- 不要使用本地模式
5.分区是分的文件目录,分桶是分的整个数据文件