大数据复习笔记——hive

2023-11-20

这次主要讲解一下平常使用较多的数据仓库hive

一、Hive

1、Hive的介绍

hive提供了HiveQL方言来查询存储在hadoop集群中的数据。hive可以将大多数的查询转换为MapReduce作业。
hive最适合于数据仓库，使用数据仓库进行相关的静态数据分析，而不需要快速响应给出结果，而且数据本身不会频繁变化。

数据仓库
数据仓库是信息（对其进行分析可做出更明智的决策）的中央存储库。通常，数据定期从事务系统、关系数据库和其他来源流入数据仓库。业务分析师、数据科学家和决策者通过商业智能 (BI) 工具、SQL 客户端和其他分析应用程序访问数据。

2、Hive的搭建模式

a）内嵌Derby模式

使用内嵌的默认元数据数据库Derby，单进程访问。

注：使用derby存储方式时，运行hive会在当前目录生成一个derby文件和一个metastore_db目录。这种存储方式的弊端是在同一个目录下同时只能有一个hive客户端能使用数据库，否则会提示错误

b）Local方式

使用其他的关系型数据库，该关系型数据库和hive在同一个节点。（例如mysql）

这里需要注意下mysql的权限应与hive的权限一致

c）Remote方式

使用其他的关系型数据库，该关系型数据库和hive不在同一个节点。

3、Hive的数据库和表操作

数据库操作

a）创建数据库

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, …)];

hive> create database mydb1 comment 'my db two' location '/user/hive/mymydbdb2' with dbproperties ('key1'='value1', 'key2' = 'value2');
OK
Time taken: 0.077 seconds

b）删除数据库

DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
restrict: 确保只有不存在相关视图和完整性约束的表才能删除
cascade: 任何相关视图和完整性约束一并被删除
默认使用cascade

hive> drop database if exists mydb1;
OK
Time taken: 0.011 seconds

c）修改数据库

ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, …);
ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;

hive> desc database extended mydb1;
OK
mydb1	my db one	hdfs://mycluster/user/hive/mydb1	root	USER	
Time taken: 0.045 seconds, Fetched: 1 row(s)
hive> alter database mydb1 set dbproperties('key1'='value1','key2'='value2');
OK
Time taken: 0.051 seconds
hive> desc database extended mydb1;
OK
mydb1	my db one	hdfs://mycluster/user/hive/mydb1	root	USER	{key1=value1, key2=value2}
Time taken: 0.035 seconds, Fetched: 1 row(s)

d）使用数据库

USE database_name;
USE DEFAULT;

建表操作
示例

hive> create table tb_user5 (id int, name string, age int, likes array<string>, addrs map<string, string>) row format delimited fields terminated by ',' collection items terminated by '-' map keys terminated by ':' lines terminated by '\n' stored as SEQUENCEFILE location '/user/hive/u5';
OK
Time taken: 0.096 seconds

分区表（为了加快搜索）
a、单分区建表语句：
① create table day_table (id int, content string) partitioned by (dt string);
② 单分区表，按天分区，在表结构中存在id，content，dt三列。
③ 以dt为文件夹区分
b、双分区建表语句：
① create table day_hour_table (id int, content string) partitioned by (dt string, hour string);
② 双分区表，按天和小时分区，在表结构中新增加了dt和hour两列。
③ 先以dt为文件夹，再以hour子文件夹区分

添加分区表语法

ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION ‘location’][, PARTITION partition_spec [LOCATION ‘location’], …];

partition_spec:
:(partition_column = partition_col_value, partition_column = partition_col_value, …)

hive> alter table tb_user1 add partition (age=10) location '/user/hive';
OK
Time taken: 0.239 seconds
hive> load data local inpath '/root/users.txt' into table tb_user1 partition (age=10);
Loading data to table mydb2.tb_user1 partition (age=10)
Partition mydb2.tb_user1{age=10} stats: [numFiles=3, numRows=0, totalSize=1388, rawDataSize=0]
OK
Time taken: 1.836 seconds

4、Hive的SerDe

① SerDe 用于做序列化和反序列化。
② 构建在数据存储和执行引擎之间，对两者实现解耦。
③ Hive通过ROW FORMAT DELIMITED以及SERDE进行内容的读写。

5、Hive的索引

目的：优化查询以及检索性能

a）创建索引

create index t1_index on table tb_user2(name)
as ‘org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’ with deferred rebuild
in table t1_index_table;
as：指定索引器；
in table：指定索引表，若不指定默认生成在default__tb_user2_t1_index__表中
create index t2_index on table tb_user2(name)
as ‘org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’ with deferred rebuild;
with deferred rebuild表示在执行alter index xxx_index on xxx rebuild时将调用generateIndexBuildTaskList获取Index的MapReduce，并执行为索引填充数据。

b）查询索引

show index on psn2;

c）重建索引

建立索引之后必须重建索引才能生效）
ALTER INDEX t1_index ON tb_user2 REBUILD;

d）删除索引

DROP INDEX IF EXISTS t1_index ON psn2;

6、Hive的运行方式

命令行方式cli：控制台模式
脚本运行方式（实际生产环境中用最多）
JDBC方式：hiveserver2
web GUI接口（hwi、hue等）

还有一些Hive的分桶、视图、动态分区等一些总结，由于自己没有去深入了解并且没有进行实操，所以在这就不写出来丢人现眼，有需要的小伙伴可以去一些大佬的博看看看，他们会总结的非常仔细。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)