Hive基本使用(5)

2023-11-19

三、排序

1. Order By: 全局排序,只有一个Reducer


ASC(ascend): 升序(默认)
DESC(descend): 降序

b. ORDER BY 子句在SELECT语句的结尾
demo1:按照工资升序

hive (dyhtest)> select * from emp order by sal;
Query ID = atdyh_20220705230538_47b4d6c8-e5ba-426a-8924-c0807a85e328
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1657033458921_0001, Tracking URL = http://hadoop103:8088/proxy/application_1657033458921_0001/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1657033458921_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-07-05 23:05:51,612 Stage-1 map = 0%,  reduce = 0%
2022-07-05 23:05:59,884 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.83 sec
2022-07-05 23:06:06,052 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.14 sec
MapReduce Total cumulative CPU time: 5 seconds 140 msec
Ended Job = job_1657033458921_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 5.14 sec   HDFS Read: 13396 HDFS Write: 906 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 140 msec
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
NULL	CLERK	7902	NULL	800.00	NULL	20.0	NULL
7900	JAMES	CLERK	7698	1981-12-3	950.0	NULL	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	NULL	20
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7934	MILLER	CLERK	7782	1982-1-23	1300.0	NULL	10
7844	TURNER	SALESMAN	7698	1981-9-8	1500.0	0.0	30
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	NULL	30
7566	JONES	MANAGER	7839	1981-4-2	2975.0	NULL	20
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	NULL	20
7902	FORD	ANALYST	7566	1981-12-3	3000.0	NULL	20
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
Time taken: 28.548 seconds, Fetched: 14 row(s)

demo2:按照工资降序

hive (dyhtest)> select * from emp order by sal desc;
Query ID = atdyh_20220705232041_06712168-6943-49ef-954d-67c02eeac0a5
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1657033458921_0002, Tracking URL = http://hadoop103:8088/proxy/application_1657033458921_0002/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1657033458921_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-07-05 23:20:48,938 Stage-1 map = 0%,  reduce = 0%
2022-07-05 23:20:58,386 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.1 sec
2022-07-05 23:21:05,576 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.51 sec
MapReduce Total cumulative CPU time: 6 seconds 510 msec
Ended Job = job_1657033458921_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.51 sec   HDFS Read: 13479 HDFS Write: 906 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 510 msec
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
7902	FORD	ANALYST	7566	1981-12-3	3000.0	NULL	20
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	NULL	20
7566	JONES	MANAGER	7839	1981-4-2	2975.0	NULL	20
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	NULL	30
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7844	TURNER	SALESMAN	7698	1981-9-8	1500.0	0.0	30
7934	MILLER	CLERK	7782	1982-1-23	1300.0	NULL	10
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	NULL	20
7900	JAMES	CLERK	7698	1981-12-3	950.0	NULL	30
NULL	CLERK	7902	NULL	800.00	NULL	20.0	NULL
Time taken: 24.705 seconds, Fetched: 14 row(s)

hive在做sql解析的时候会根据我们自己写的sql解析map 和reduce任务个数:
在这里插入图片描述
注意:
a. 我们可以通过参数进行设置reduce,但是全局排序reduce只有一个。
b.set mapreduce.job.reduces=-1; //我们跟随任务走,需要几个起几个reduce任务,非-1的值就是实际启动的reduce任务个数
c.在客户端设置(非配置文件等)的时效性是本次session,会话断开,需要在设置

如下:

hive (dyhtest)> set mapreduce.job.reduces=-3;
hive (dyhtest)> select * from emp order by sal desc;
Query ID = atdyh_20220705234313_7a1e822e-2798-4ad8-8d7c-969ed7bb98ab
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1657033458921_0003, Tracking URL = http://hadoop103:8088/proxy/application_1657033458921_0003/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1657033458921_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-07-05 23:43:22,017 Stage-1 map = 0%,  reduce = 0%
2022-07-05 23:43:29,306 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.44 sec
2022-07-05 23:43:34,481 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.03 sec
MapReduce Total cumulative CPU time: 5 seconds 30 msec
Ended Job = job_1657033458921_0003
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 5.03 sec   HDFS Read: 13479 HDFS Write: 906 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 30 msec
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
7902	FORD	ANALYST	7566	1981-12-3	3000.0	NULL	20
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	NULL	20
7566	JONES	MANAGER	7839	1981-4-2	2975.0	NULL	20
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	NULL	30
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7844	TURNER	SALESMAN	7698	1981-9-8	1500.0	0.0	30
7934	MILLER	CLERK	7782	1982-1-23	1300.0	NULL	10
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	NULL	20
7900	JAMES	CLERK	7698	1981-12-3	950.0	NULL	30
NULL	CLERK	7902	NULL	800.00	NULL	20.0	NULL
Time taken: 22.062 seconds, Fetched: 14 row(s)

2.Sort By
a. 对于大规模的数据集order by的效率非常低。在很多情况下,并不需要全局排序,此时可以使用sort by。
b. Sort by为每个reducer产生一个排序文件。每个Reducer内部进行排序,对全局结果集来说不是排序。

---设置reduce任务个数
hive (dyhtest)> set mapreduce.job.reduces=3;
--- 查看reduce任务设置的个数
hive (dyhtest)> set mapreduce.job.reduces;
mapreduce.job.reduces=3

--- 跑任务验证
hive (dyhtest)> select * from emp sort by deptno desc;
Query ID = atdyh_20220711172431_1e75f23e-3e66-414e-a52b-293c241febd6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 3
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1657525216716_0001, Tracking URL = http://hadoop103:8088/proxy/application_1657525216716_0001/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1657525216716_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
2022-07-11 17:24:50,409 Stage-1 map = 0%,  reduce = 0%
2022-07-11 17:24:58,815 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.09 sec
2022-07-11 17:25:09,084 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 10.43 sec
2022-07-11 17:25:14,357 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 18.82 sec
2022-07-11 17:25:15,385 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 28.02 sec
MapReduce Total cumulative CPU time: 28 seconds 20 msec
Ended Job = job_1657525216716_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 3   Cumulative CPU: 28.02 sec   HDFS Read: 26550 HDFS Write: 1080 SUCCESS
Total MapReduce CPU Time Spent: 28 seconds 20 msec
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7844	TURNER	SALESMAN	7698	1981-9-8	1500.0	0.0	30
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	NULL	30
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	NULL	20
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7900	JAMES	CLERK	7698	1981-12-3	950.0	NULL	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	NULL	20
7566	JONES	MANAGER	7839	1981-4-2	2975.0	NULL	20
7934	MILLER	CLERK	7782	1982-1-23	1300.0	NULL	10
7902	FORD	ANALYST	7566	1981-12-3	3000.0	NULL	20
NULL	CLERK	7902	NULL	800.00	NULL	20.0	NULL
Time taken: 44.645 seconds, Fetched: 14 row(s)

3.Distribute By
a. 在有些情况下,我们需要控制某个特定行应该到哪个reducer,通常是为了进行后续的聚集操作,distribute by 子句可以做这件事。
b.distribute by类似MR中partition(自定义分区),进行分区,结合sort by使用。
c.distribute by的分区规则是根据分区字段的hash码与reduce的个数进行模除后,余数相同的分到一个区。
d.Hive要求DISTRIBUTE BY语句要写在SORT BY语句之前。

demo :先按照部门编号分区,再按照员工编号降序排序。

--- 创建输出文件夹
[atdyh@hadoop102 datas]$ mkdir sort—by
[atdyh@hadoop102 sort—by]$ pwd
/opt/module/hive-3.1.2/datas/sort—by

--- 把分区结果放到文件夹中
hive (dyhtest)> insert overwrite local directory '/opt/module/hive-3.1.2/datas/sort—by' select * from emp distribute by deptno sort by empno desc;
Query ID = atdyh_20220711180613_efe25699-1bd3-49f7-876e-3a7d6149ebdc
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 3
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1657525216716_0003, Tracking URL = http://hadoop103:8088/proxy/application_1657525216716_0003/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1657525216716_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
2022-07-11 18:06:19,136 Stage-1 map = 0%,  reduce = 0%
2022-07-11 18:06:26,307 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.04 sec
2022-07-11 18:06:36,916 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 9.75 sec
2022-07-11 18:06:37,943 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 25.68 sec
MapReduce Total cumulative CPU time: 25 seconds 680 msec
Ended Job = job_1657525216716_0003
Moving data to local directory /opt/module/hive-3.1.2/datas/sort—by
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 3   Cumulative CPU: 25.68 sec   HDFS Read: 25601 HDFS Write: 651 SUCCESS
Total MapReduce CPU Time Spent: 25 seconds 680 msec
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
Time taken: 25.687 seconds

3.Cluster By
a. 当distribute by和sort by字段相同时,可以使用cluster by方式。
b. cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是升序排序,不能指定排序规则为ASC或者DESC。

--- 这两个sql效果一样
hive (default)> select * from emp cluster by deptno;
hive (default)> select * from emp distribute by deptno sort by deptno;

注意:按照部门编号分区,不一定就是固定死的数值,可以是20号和30号部门分到一个分区里面去。

四、分区表

a. 特点:

  • 分区表实际上就是对应一个HDFS文件系统上的独立的文件夹,该文件夹下是该分区所有的数据文件
  • Hive中的分区就是分目录,把一个大的数据集根据业务需要分割成小的数据集。
  • 在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区,这样的查询效率会提高很多。

b.分区表创建语法:
create table 表名(符合公司规范)(
字段名1 字段类型,
字段名2 字段类型,


partitioned by (分区字段 分区字段类型)
row format delimited fields terminated by ‘\t’;

--- 创建分区表
hive (dyhtest)> create table dept_partition (
              >   deptno int ,
              >   dname string,
              >   loc string 
              > ) 
              > partitioned by (day string ) 
              > row format delimited fields terminated by '\t' ;
OK
Time taken: 4.429 seconds

--- 准备数据
[atdyh@hadoop102 patitionedby]$ vim 2022-7-12
[atdyh@hadoop102 patitionedby]$ vim 2022-7-13
[atdyh@hadoop102 patitionedby]$ vim 2022-7-14

---往分区表插入数据
hive (dyhtest)> load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-12' into table dept_partition partition(day='2022-7-12');
Loading data to table dyhtest.dept_partition partition (day=2022-7-12)
OK
Time taken: 1.989 seconds
hive (dyhtest)> load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-13' into table dept_partition partition(day='2022-7-13');
Loading data to table dyhtest.dept_partition partition (day=2022-7-13)
OK
Time taken: 0.617 seconds
hive (dyhtest)> load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-14' into table dept_partition partition(day='2022-7-14');
Loading data to table dyhtest.dept_partition partition (day=2022-7-14)
OK
Time taken: 0.548 seconds

分区表对应hdfs路径:
在这里插入图片描述
c .分区操作:
查看分区:

hive (dyhtest)> show partitions dept_partition;
OK
partition
day=2022-7-12
day=2022-7-13
day=2022-7-14
Time taken: 0.243 seconds, Fetched: 3 row(s)

添加分区:

hive (dyhtest)> alter table dept_partition add partition(day ='2022-07-15');
OK
Time taken: 0.292 seconds
hive (dyhtest)> show partitions dept_partition;
OK
partition
day=2022-07-15
day=2022-7-12
day=2022-7-13
day=2022-7-14
Time taken: 0.149 seconds, Fetched: 4 row(s)

删除分区 :

hive (dyhtest)> alter table dept_partition drop partition(day ='2022-07-15');
Dropped the partition day=2022-07-15
OK
Time taken: 1.35 seconds
hive (dyhtest)> show partitions dept_partition;
OK
partition
day=2022-7-12
day=2022-7-13
day=2022-7-14
Time taken: 0.113 seconds, Fetched: 3 row(s)

**注意:**如果一次删除多个分区,需要在分区之前添加逗号

加分区:
alter table dept_partition add partition(day=‘2020-04-04’);
alter table dept_partition add partition(day=‘2020-04-05’) partition(day=‘2020-04-06’) ;

删分区:
alter table dept_partition drop partition(day=‘20200404’);
alter table dept_partition drop partition(day=‘20200405’) ,partition(day=‘20200406’);

d.二级(多级)分区:
创建二级分区表:

hive (dyhtest)> create table dept_partition2(
              >   deptno int, 
              >   dname string, 
              >   loc string 
              > )
              > partitioned by (day string ,hour string )
              > row format delimited fields terminated by '\t' ;
OK
Time taken: 0.193 seconds

加载数据:

hive (dyhtest)> 
              > load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-13'  
              >                    into table dept_partition2 partition(day='2022-7-12',hour='13') ;
Loading data to table dyhtest.dept_partition2 partition (day=2022-7-12, hour=13)
OK
Time taken: 0.42 seconds
hive (dyhtest)> 
              > load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-14'  
              >                    into table dept_partition2 partition(day='2022-7-12',hour='14') ;
Loading data to table dyhtest.dept_partition2 partition (day=2022-7-12, hour=14)
OK
Time taken: 0.556 seconds
hive (dyhtest)> 
              > load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-12'  
              >                    into table dept_partition2 partition(day='2022-7-13',hour='07') ; 
Loading data to table dyhtest.dept_partition2 partition (day=2022-7-13, hour=07)
OK
Time taken: 0.462 seconds
hive (dyhtest)> 
              > load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-13'  
              >                    into table dept_partition2 partition(day='2022-7-13',hour='08') ;
Loading data to table dyhtest.dept_partition2 partition (day=2022-7-13, hour=08)
OK
Time taken: 0.421 seconds
hive (dyhtest)> 
              > load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-14'  
              >                    into table dept_partition2 partition(day='2022-7-13',hour='09') ;       
Loading data to table dyhtest.dept_partition2 partition (day=2022-7-13, hour=09)
OK
Time taken: 0.441 seconds

在这里插入图片描述
d.数据和分区表关联的方式

  • 手动创建分区目录,手动上传数据 , 进行分区修复操作
--- 创建hdfs目录
[atdyh@hadoop102 patitionedby]$ hadoop fs -mkdir /user/hive/warehouse/dyhtest.db/dept_partition/day=2022-7-15

--- 在文件目录下上传数据
[atdyh@hadoop102 patitionedby]$ hadoop fs -put /opt/module/hive-3.1.2/datas/patitionedby/2022-7-12  /user/hive/warehouse/dyhtest.db/dept_partition/day=2022-7-15
2022-07-13 01:26:27,861 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false

--- 修复分区
hive (dyhtest)> show partitions dept_partition;
OK
partition
day=2022-7-12
day=2022-7-13
day=2022-7-14
Time taken: 0.221 seconds, Fetched: 3 row(s)
hive (dyhtest)> msck repair table dept_partition;
OK
Partitions not in metastore:	dept_partition:day=2022-7-15
Repair: Added partition to metastore dept_partition:day=2022-7-15
Time taken: 0.593 seconds, Fetched: 2 row(s)
hive (dyhtest)> show partitions dept_partition;
OK
partition
day=2022-7-12
day=2022-7-13
day=2022-7-14
day=2022-7-15
Time taken: 0.121 seconds, Fetched: 4 row(s)


注意:修复分区命令经常会用到 (msck repair table 表名;)

  • 手动创建分区目录,手动上传数据 , 手动添加分区
--- 创建hdfs目录
[atdyh@hadoop102 patitionedby]$ hadoop fs -mkdir /user/hive/warehouse/dyhtest.db/dept_partition/day=2022-7-16

--- 上传数据
[atdyh@hadoop102 patitionedby]$ hadoop fs -put /opt/module/hive-3.1.2/datas/patitionedby/2022-7-12  /user/hive/warehouse/dyhtest.db/dept_partition/day=2022-7-16
2022-07-13 01:44:32,687 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false

--- 添加分区
hive (dyhtest)> alter table dept_partition add partition(day='2022-7-16');
OK
Time taken: 0.435 seconds
hive (dyhtest)> show partitions dept_partition;
OK
partition
day=2022-7-12
day=2022-7-13
day=2022-7-14
day=2022-7-15
day=2022-7-16
Time taken: 0.108 seconds, Fetched: 5 row(s)


注意: 添加分区语法经常用到(alter table 表名 add partition (分区字段= ‘分区字段值’))

  • 手动创建分区目录,load数据到分区中
--- 创建hdfs目录
[atdyh@hadoop102 patitionedby]$ hadoop fs -mkdir /user/hive/warehouse/dyhtest.db/dept_partition/day=2022-7-17

--- 加载数据
hive (dyhtest)> load data local inpath '/opt/module/hive-3.1.2/datas/patitionedby/2022-7-12'  
              >                    into table dept_partition partition(day='2022-7-17') ; 
Loading data to table dyhtest.dept_partition partition (day=2022-7-17)
OK
Time taken: 2.476 seconds

e. 动态分区

  • 关系型数据库中,对分区表Insert数据时候,数据库自动会根据分区字段的值,将数据插入到相应的分区中
  • Hive中也提供了类似的机制,即动态分区(Dynamic Partition),只不过,使用Hive的动态分区,需要进行相应的配置。
--- 默认的是开启的,
hive (dyhtest)> set hive.exec.dynamic.partition;
hive.exec.dynamic.partition=true

--- 设置为非严格模式(动态分区的模式,默认strict,表示必须指定至少一个分区为静态分区,nonstrict模式表示允许所有的分区字段都可以使用动态分区。)

--- 查看默认模式
hive (dyhtest)> set hive.exec.dynamic.partition.mode;
hive.exec.dynamic.partition.mode=strict
--- 设置非严格模式
hive (dyhtest)> set hive.exec.dynamic.partition.mode=nostrict;
--- 查看设置后的模式
hive (dyhtest)> set hive.exec.dynamic.partition.mode;
hive.exec.dynamic.partition.mode=nostrict

补充:

  • 在所有执行MR的节点上,最大一共可以创建多少个动态分区。默认1000
    hive.exec.max.dynamic.partitions=1000
  • 在每个执行MR的节点上,最大可以创建多少个动态分区。该参数需要根据实际的数据来设定。比如:源数据中包含了一年的数据,即day字段有365个值,那么该参数就需要设置成大于365,如果使用默认值100,则会报错。
    hive.exec.max.dynamic.partitions.pernode=100
  • 整个MR Job中,最大可以创建多少个HDFS文件。默认100000
    hive.exec.max.created.files=100000
  • 当有空分区生成时,是否抛出异常。一般不需要设置。默认false
    hive.error.on.empty.partition=false

以上几个参数按需选择

demo:

--- 准备数据
[atdyh@hadoop102 datas]$ cp dept.txt dept_partition_dy.txt 
[atdyh@hadoop102 datas]$ ll
总用量 52
-rw-rw-r--. 1 atdyh atdyh   69 7月  13 10:29 dept_partition_dy.txt
	
--- load数据
hive (dyhtest)> load data local inpath 'opt/module/hive-3.1.2/datas/dept_partition_dy.txt' into table dept_partition_dy;
Query ID = atdyh_20220713104944_6f5aa5d6-4f15-4725-b75e-7fe273fc97a4
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1657525216716_0004, Tracking URL = http://hadoop103:8088/proxy/application_1657525216716_0004/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1657525216716_0004
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 1
2022-07-13 10:50:06,384 Stage-1 map = 0%,  reduce = 0%
2022-07-13 10:50:17,880 Stage-1 map = 0%,  reduce = 100%, Cumulative CPU 5.36 sec
MapReduce Total cumulative CPU time: 5 seconds 360 msec
Ended Job = job_1657525216716_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://hadoop102:9820/user/hive/warehouse/dyhtest.db/dept_partition_dy/.hive-staging_hive_2022-07-13_10-49-45_694_2569470323021587088-1/-ext-10000
Loading data to table dyhtest.dept_partition_dy partition (day=null)


	 Time taken to load dynamic partitions: 0.043 seconds
	 Time taken for adding to write entity : 0.0 seconds
MapReduce Jobs Launched: 
Stage-Stage-1: Reduce: 1   Cumulative CPU: 5.36 sec   HDFS Read: 8981 HDFS Write: 87 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 360 msec
OK
dept_partition_dy__temp_table_for_load_data__.deptno	dept_partition_dy__temp_table_for_load_data__.dname	dept_partition_dy__temp_table_for_load_data__.loc	dept_partition_dy__temp_table_for_load_data__.day
Time taken: 34.843 seconds

注意: 加载数据的时候文件需要放到hdfs上,否则会出现文件找不到的问题

五、分桶表

  • 分区提供一个隔离数据和优化查询的便利方式。并非所有的数据集都可形成合理的分区。
  • 对于一张表或者分区,Hive 可以进一步组织成桶,也就是更为细粒度的数据范围划分。
  • 分桶是将数据集分解成更容易管理的若干部分的另一个技术。
  • 分区针对的是数据的存储路径;分桶针对的是数据文件。
    demo:
--- 准备数据
[atdyh@hadoop102 datas]$ hadoop fs -put student.txt /
2022-07-13 12:16:19,414 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false

--- 创建分桶表
hive (dyhtest)> create table student_bucket(
              >   id int, 
              >   name string 
              > )
              > clustered by (id)
              > into 4 buckets 
              > row format delimited fields terminated by '\t' ;   
OK
Time taken: 0.111 seconds

--- 查看表结构


--- 加载数据
hive (dyhtest)> load  data  inpath '/student.txt' into table student_bucket ;
Query ID = atdyh_20220713121702_9e78cb07-fdfe-4745-8c9e-02811b925b44
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1657525216716_0008, Tracking URL = http://hadoop103:8088/proxy/application_1657525216716_0008/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1657525216716_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
2022-07-13 12:17:14,891 Stage-1 map = 0%,  reduce = 0%
2022-07-13 12:17:22,143 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.61 sec
2022-07-13 12:17:31,862 Stage-1 map = 100%,  reduce = 25%, Cumulative CPU 12.09 sec
2022-07-13 12:17:36,310 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 21.05 sec
2022-07-13 12:17:37,396 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 49.33 sec
MapReduce Total cumulative CPU time: 49 seconds 330 msec
Ended Job = job_1657525216716_0008
Loading data to table dyhtest.student_bucket
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1657525216716_0009, Tracking URL = http://hadoop103:8088/proxy/application_1657525216716_0009/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1657525216716_0009
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2022-07-13 12:17:52,462 Stage-3 map = 0%,  reduce = 0%
2022-07-13 12:17:57,937 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 2.65 sec
2022-07-13 12:18:04,327 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 5.67 sec
MapReduce Total cumulative CPU time: 5 seconds 670 msec
Ended Job = job_1657525216716_0009
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 4   Cumulative CPU: 49.33 sec   HDFS Read: 33366 HDFS Write: 1249 SUCCESS
Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 5.67 sec   HDFS Read: 12156 HDFS Write: 335 SUCCESS
Total MapReduce CPU Time Spent: 55 seconds 0 msec
OK
student_bucket__temp_table_for_load_data__.id	student_bucket__temp_table_for_load_data__.name
Time taken: 62.976 seconds

注意:

  1. Hive的分桶采用对分桶字段的值进行哈希,然后除以桶的个数求余的方 式决定该条记录存放在哪个桶当中
  2. reduce的个数设置为-1,让Job自行决定需要用多少个reduce或者将reduce的个数设置为大于等于分桶表的桶数
  3. 从hdfs中load数据到分桶表中,避免本地文件找不到问题
  4. 不要使用本地模式
    5.分区是分的文件目录,分桶是分的整个数据文件
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

Hive基本使用(5) 的相关文章

随机推荐

  • 3. 决策树原理及数学建模实战

    决策树 文章目录 一 概念 二 ID3 2 1 概念 2 2 算法流程 2 3 信息熵 2 4 特点 三 C4 5 3 1 概念 3 2 信息增益率 3 3 处理连续值 3 4 过拟合问题 3 4 1 提出 3 4 2 剪枝 四 CART
  • sqli-labs靶场Less-7

    备注 虽然从首页进来就知道是dump into outfile 但我还是假设按不知道的流程来一步步尝试 这样才会印象深刻 不然我觉得失去练习的意义了 1 访问首页 Less 7 index php id 1 这里的传参点是id 探测六步 判
  • LDO(低压差线性稳压器)设计电源注意事项(学习笔记)

    1 LDO最大输出电流 按照2 3原则选择 即电路总消耗电流的3 2倍 若电路总消耗电流50 mA 那么LDO的最大输出电流为75mA 2 封装散热以及功耗 功耗 输入电压 输出电压 工作电流 按照1 2原则选择LDO封装 达不到的可以PC
  • 多个js文件调用函数问题

    多个js文件调用函数问题 最近在做一个项目 用的 jquery 和 easyui 有很多常用的函数我就把它们写到了common js里面 然后又写了一link jsp 把常用的css和js文件都写在里面 然后页面直接include 写着写着
  • 蒸馏神经网络(Distill the Knowledge in a Neural Network)

    本文是阅读Hinton 大神在2014年NIPS上一篇论文 蒸馏神经网络的笔记 特此说明 此文读起来很抽象 大篇的论述 鲜有公式和图表 但是鉴于和我的研究方向 神经网络的压缩十分相关 因此决定花气力好好理解一下 1 Introduction
  • vuepress-yarn-nodes-静态网页_个人博客搭建

    nodes官网 https nodejs org en 先下载nodes进行安装 一般nodes会自带包管理器npm 注意npm与nodes的对应关系 除了npm之外还有yarn包管理器 一般会用npm安装这个包 npm install g
  • esp32cam门禁系统简易教程

    esp32cam门禁系统简易教程 人脸识别 1 环境安装 最好有梯子 arduino IDE 1 官网下载地址 选择相应版本下载Windows ZIP file 无脑安装 2 配置IDE 打开IDE 文件 gt 首选项 gt 附加开发板管理
  • Android属性动画

    http bbs itheima com thread 172632 1 1 html 什么是Android属性动画 属性动画 Property Animation 系统是一个健壮的动画框架系统 它可以满足你大部分动画需求 不管动画对象是否
  • Spring Boot 使用及启动源码解析一

    前言 本篇文章会介绍Spring Boot 的基本原理 以及以及一些使用 常见的配置方式等 如何从单一架构延申到现在的前后端分离 垂直应用架构 的项目 从网站流量很小到现在的网站流量动则几百万上下的 发展 加速前端的架构 到后面 的分布式服
  • [QT编程系列-25]:多线程机制 - QThread和MoveToThread简介

    目录 第1章 简介 1 1 多线程的目的 1 2 QThread多线程使用方法 1 3 QT支持多线的步骤 第2章 QThread 2 1 概述 2 2 moveToThread 第1章 简介 1 1 多线程的目的 QThread类提供了一
  • deepin访问不了网页

    deepin15 解决访问不了网页 IP能ping通 页面访问不了 IP能ping通 ping域名失败 是下边这个情况 执行成功 ping 202 108 22 5 baidu的ip 执行失败 ping www baidu com 是因为浏
  • ElementUi常用组件创建前端页面

    elementui 创建前端页面
  • Qt小项目2 图片查看器

    头文件 ifndef WIDGET H define WIDGET H include
  • Shell脚本概述、简单Shell脚本的编写

    一 shell概述 shell是一个命令行解释器 它接收应用程序 用户命令 然后调用操作系统内核 shell还是一个强大的编程语言 易编写 易调试 灵活性强 二 shell解析器 1 Linux提供的shell解析器有 root CS YT
  • 大起大落,蚂蚁上市被叫停,蚂蚁的程序员们怎么样了?

    继马云被有关部门联合约谈以后 万众瞩目的蚂蚁上市被叫停了 一石激起千层浪 这个爆炸性的新闻引起了人们的热议 来看看大家都说了些什么 首先表达一下对蚂蚁金服员工的深切同情 毕竟之前大家都以为马上就能实现财务自由 走上人生巅峰 结果来了这么一出
  • 机器学习-Day04

    在处理包含字符串的数据时使用pandas 常用的数据类型 1 series一维 带标签数组 2 dataframe二维 Series容器 1 pandas索引 import pandas as pd t pd Series 1 21 31
  • Android Studio的build.gradle里面的各种版本信息

    Android studio 是采用 Gradle 来构建项目 Gradle 是一个非常先进的项目构建工具 我们在导入Android项目后 只要项目同步成功 就会出现以下文件夹 如图是build gradle Module app 文件的代
  • python3字符串与二进制互相转换

    人闲太久 努力一下就以为是在拼命 一 前言 python中 没有 0 1 形式的二进制类型 但我们依然可以存储二进制类型的数据 利用字符串 string 类型 可以存储二进制数据 即 将二进制数据以字符串的形式存储 下面分享一种字符串和二进
  • IDEA——》安装Scala插件

    推荐链接 总结 Java 总结 Mysql 总结 Redis 总结 Kafka 总结 Spring 总结 SpringBoot 总结 MyBatis MyBatis Plus 总结 Linux 总结 MongoDB 总结 Elasticse
  • Hive基本使用(5)

    三 排序 1 Order By 全局排序 只有一个Reducer ASC ascend 升序 默认 DESC descend 降序 b ORDER BY 子句在SELECT语句的结尾 demo1 按照工资升序 hive dyhtest gt