基本操作
$>cat test.txt
12,23,23,34 what,are,this
34,45,34,23,12 who,am,i,are
hive> create table t_afan_test
> (
> info1 array<int>,
> info2 array<string>
> )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> COLLECTION ITEMS TERMINATED BY ','
> ;
hive> LOAD DATA LOCAL INPATH 'test.txt' OVERWRITE INTO TABLE t_afan_test;
hive> select * from t_afan_test;
OK
[12,23,23,34] ["what","are","this"]
[34,45,34,23,12] ["who","am","i","are"]
Time taken: 0.429 seconds
hive> select size(info1), size(info2) from t_afan_test;
OK
4 3
5 4
Time taken: 20.171 seconds
hive> select info1[2], info2[0] from t_afan_test;
23 what
34 who
Time taken: 10.88 seconds
hive size计算数组长度的一个坑
hive上有个表,其中某列p_9的数据格式是用逗号分隔的字符串。通过下面的方式计算p_9列使用逗号分隔后元素的长度。
select rg,sum(size(split(p_9,","))) from ttengine_api_data where dt='2017-08-07' group by rg;
OK
0 137683
1 150155
如果p_9列不为空,那么计算是没问题的。如果是空(“”或者null),则计算后是有问题的。仔细查了一下,发现是size(split(p_9,",")) 有问题,即:
如果p_9是空或者null,那么split成数组后,在计算数据的长度居然是1.知道了原因,那么改起来很简单,使用下面的方式统计就没问题了:
select rg,sum(if(length(p_9)==0,0,size(split(p_9,",")))) from ttengine_api_data where dt='2017-08-07' group by rg;
OK
0 0
1 6373