impala高级设置set之APPX_COUNT_DISTINCT

2023-11-06

官网地址

https://impala.apache.org/docs/build/html/topics/impala_appx_count_distinct.html

When the APPX_COUNT_DISTINCT query option is set to TRUE, Impala implicitly converts COUNT(DISTINCT) operations to the NDV() function calls. The resulting count is approximate rather than precise. Enable the query option when a tolerable amount of error is acceptable in order to obtain faster query results than with a COUNT (DISTINCT) queries.

Type: Boolean; recognized values are 1 and 0, or true and false; any other value interpreted as false

Default: false (shown as 0 in output of SET statement)

简单的来说就是设置APPX_COUNT_DISTINCT=1后也就是true 会在select count(distinct column)的时候查询的更快但是查询结果不一定准确

实战

select count(distinct company_id) from ia_fdw_b_profile_product_info--4.97s 27664540

set APPX_COUNT_DISTINCT=1

select count(distinct company_id) from ia_fdw_b_profile_product_info--1.38s 29165564

这里可以看到设置之后查询速度快了几倍，但是误差也不小，所以慎用。。。

NDV()

这里提到了一个ndv函数，但是我在show functions的时候没看到，后面凑巧在官网看到了。

也学习一番

https://impala.apache.org/docs/build/html/topics/impala_ndv.html#ndv

简介

An aggregate function that returns an approximate value similar to the result of COUNT(DISTINCT col), the "number of distinct values". It is much faster than the combination of COUNT and DISTINCT, and uses a constant amount of memory and thus is less memory-intensive for columns with high cardinality.

一种聚合函数，返回的值与COUNT(DISTINCT col)差不多，对值去重。它比count(distinct )快的多，并使用固定的内存量，因此对于具有高基数的列，内存占用较少。

语法:

NDV([DISTINCT | ALL] expression [,scale])

说实话这个表达式我还没看懂。。举例使用就是 select ndv(colum) from table =select count(distinct column) from table。这个scale是一个精度，但是根本写了就报错。

Note: The optional argument scale must be an integer and can be in the range from 1 to 10 and maps to a precision used by the HyperLogLog (HLL) algorithm with the following mapping formula:

precision = scale + 8

Therefore a scale of 1 is mapped to a precision of 9 and a scale of 10 is mapped to a precision of 18.

Without the optional argument, the precision which determines the total number of different estimators in the HLL algorithm will be still 10.

A large precision value generally produces a better estimation with less error than a small precision value. This is due to the extra number of estimators involved. The expense is at the need of extra memory. For a given precision p, the amount of memory used by the HLL algorithm is in the order of 2^p bytes.

When provided a scale of 10 against a total of 22 distinct data sets loaded into external Impala tables, the error will be computed as abs(<true_unique_value> - <estimated_unique_value>) / <true_unique_value>

The scale of 10, mapped to the precision of 18, yielded the worst estimation error at 0.42% (for one set of 10 million integers), and average error no more than 0.17%. This was at the cost of 256Kb of memory for the internal data structure per evaluation of the HLL algorithm.

官方案例

Examples:

The following example queries a billion-row table to illustrate the relative performance of COUNT(DISTINCT) and NDV(). It shows how COUNT(DISTINCT) gives a precise answer, but is inefficient for large-scale data where an approximate result is sufficient. The NDV() function gives an approximate result but is much faster.

select count(distinct col1) from sample_data;
+---------------------+
| count(distinct col1)|
+---------------------+
| 100000              |
+---------------------+
Fetched 1 row(s) in 20.13s

select cast(ndv(col1) as bigint) as col1 from sample_data;
+----------+
| col1     |
+----------+
| 139017   |
+----------+
Fetched 1 row(s) in 8.91s

The following example shows how you can code multiple NDV() calls in a single query, to easily learn which columns have substantially more or fewer distinct values. This technique is faster than running a sequence of queries with COUNT(DISTINCT) calls.

select cast(ndv(col1) as bigint) as col1, cast(ndv(col2) as bigint) as col2,
    cast(ndv(col3) as bigint) as col3, cast(ndv(col4) as bigint) as col4
  from sample_data;
+----------+-----------+------------+-----------+
| col1     | col2      | col3       | col4      |
+----------+-----------+------------+-----------+
| 139017   | 282       | 46         | 145636240 |
+----------+-----------+------------+-----------+
Fetched 1 row(s) in 34.97s

select count(distinct col1) from sample_data;
+---------------------+
| count(distinct col1)|
+---------------------+
| 100000              |
+---------------------+
Fetched 1 row(s) in 20.13s

select count(distinct col2) from sample_data;
+----------------------+
| count(distinct col2) |
+----------------------+
| 278                  |
+----------------------+
Fetched 1 row(s) in 20.09s

select count(distinct col3) from sample_data;
+-----------------------+
| count(distinct col3)  |
+-----------------------+
| 46                    |
+-----------------------+
Fetched 1 row(s) in 19.12s

select count(distinct col4) from sample_data;
+----------------------+
| count(distinct col4) |
+----------------------+
| 147135880            |
+----------------------+
Fetched 1 row(s) in 266.95s

其实我到这里也没看太懂这个函数有什么用？唯一的用处就是算去重值的时候会快一点。。。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

impala

set

impala高级设置set之APPX_COUNT_DISTINCT 的相关文章

如何将两组 weka 实例合并在一起

目前我一次将一个实例从一个数据集复制到另一个数据集有没有办法做到这一点使字符串映射保持完整 mergeInstances 水平工作是否有等效的垂直合并这是我用来将多个 arff 文件中相同结构的数据集读取到一个大型数据集中的循环的
hashMap、List 和 Set 的数据结构

任何人都可以指导我深入了解所使用的数据结构以及它是如何在 Util Collection 页面的列表集合和映射中实现的在面试中大多数问题都是关于算法的但我从未在任何地方看到过实现细节有人可以分享一下信息吗要了解 Java 如何实
Python：测试空集交集而不创建新集

我经常发现自己想要测试两个集合的交集而不使用交集的结果 set1 set 1 2 set2 set 2 3 if set1 set2 print Non empty intersection else print Empty interse
Python 集合与列表

在Python中哪种数据结构更高效更快假设顺序对我来说并不重要并且无论如何我都会检查重复项那么 Python 集比 Python 列表慢吗这取决于您打算用它做什么在确定某个对象是否存在于集合中时集合的速度要快得多如x in
表达式“b=(b-x)&x”是什么意思？

假设 x 是一个集合以下代码将遍历集合 x 的子集 int b 0 do process subset b while b b x x 我读到了有关位操作及其如何用于表示集合的读物表达什么意思b b x x意思是它是如何工作的我熟悉
C++ std::set 更新很乏味：我无法就地更改元素

我发现更新操作std set乏味因为没有这样的 API参考参数 http en cppreference com w cpp container set 所以我目前所做的是这样的 find element in set by iterat
为什么最多 4 个元素的集合是有序的，而更大的元素则不是？

Given val xs1 Set 3 2 1 4 5 6 7 val ys1 Set 7 2 1 4 5 6 3 xs1 and ys1两者都导致scala collection immutable Set Int Set 5 1 6 2
是否可以使用 impala 查询包含 DATE 类型列的 Hive 表？

每次我尝试在 IMPALA 中从 HIVE 中创建的表中选择 DATE 类型字段时都会收到 AnalysisException Unsupported type DATE 有什么解决方法吗 UPDATE这是从 hive 创建表模式和 im
如何在 Ubuntu 上安装 Impala？ [关闭]

Closed 这个问题不符合堆栈溢出指南 help closed questions 目前不接受答案我要安装Impala https impala apache org 在 Ubuntu 实例上到目前为止以下方法都不起作用如何在 U
如何将项目添加到 python 中的空集中

我有以下程序 def myProc invIndex keyWord D for i in range len keyWord if keyWord i in invIndex keys D update invIndex query i
php PDO 可以获取两个结果集吗？如果是，1 个结果集和 1 个以上结果集哪个更好？

如果可能的话如何获取两个结果集 sth dbh gt prepare SELECT FROM tb1 WHERE cond1 SELECT from tb2 Where cond2 sth gt execute row sth gt fe
将一组 Java 对象转换为另一组对象的最佳方式是什么？

这是一个真正的新手提出的基本 Java 问题我有一组实现某个接口接口 MyIfc 的Java对象属于 MyClass 类我有一组这些对象存储在我的类中的私有变量中声明如下 protected Set
在包含一些通配符的大型列表中进行成员资格测试

当列表包含特殊类别时如何测试某个短语是否在大型 650k 短语列表中例如我想测试这个短语是否 he had the nerve 在列表中确实如此但是在 he had DETERMINER nerve where DETERMINE
执行 set_difference 时出错：变量结果不是结构

我在函数外部全局声明了一个设置变量 std set
set() 可以在 Python 进程之间共享吗？

我正在 Python 2 7 中使用多重处理来处理非常大的数据集当每个进程运行时它会将整数添加到共享的 mp Manager Queue 中但前提是其他进程尚未添加相同的整数由于您无法对队列进行 in 式成员资格测试因此我这样做的
固定大小集以包含给定集的最大数量

我有大约 1000 组尺寸 1 4 1 3 3 5 6 4 5 6 7 5 25 42 67 100 是否有可能找到包含最大数量的给定集合的大小为 20 的集合检查每一个100 80 20 集效率低下我不太确定这是 NP 完全的考虑
Python 和 Numpy 是 nan 和 set

我在使用 Python 的 Numpy set 和 NaN 非数字时遇到了不可预测的行为 gt gt gt set np float64 nan np float64 nan set nan nan gt gt gt set np flo
以下代码使用 std::set “合法”吗？

我有这个代码 set
用于查找列表/集合中唯一元素的代码

根据上面阴影部分的面积应该代表 A XOR B XOR C XOR A AND B AND C 如何将其翻译成Python代码代码必须与上述表达式中提供的集合操作密切相关至少这是首选该代码必须足够通用能够处理 3 个以上的列表 UP
如何将具有唯一字段的对象添加到 Set 中

如何用具有唯一字段的对象填充集合例如我有一堂课Person其中有一个独特的领域称为name因此如果我添加到 Set 一个具有重复名称的对象则不应添加它 public class Test public static void main

随机推荐

WebTransport 开播的应用实践之路

动手点关注干货不迷路 Web开播的业务挑战无论是本地软件推流还是Web推流都需要解决推流抖动画面高糊音频卡顿等问题在现有的Web技术环境下如何稳定地把高质量的音视频流呈现给更多用户是我们技术团队攻克的重点从技术角度来解读一
Hashtable和HashMap、ConcurrentHashMap 之间的区别

Hashtable和HashMap的区别 HashMap和Hashtable都是哈希表数据结构但是Hashtable是线程安全的 HashMap是线程不安全的 Hashtable实现线程安全就是简单的把关键方法都加上了synchroniz
企业项目实战----CDN加速的实现

前言 CDN加速对企业非常重要体现在哪呢举个例子 A企业的后端服务器在杭州用户遍布全国让全国的用户都去访问企业A在杭州的后端服务器你觉得可行吗肯定不可行呀第一后端服务器承受不了全国这么巨大的访问量第二访问速度慢要经过的陆
提升职场价值，把握成长方向

来自 IT人的职场进阶同样的职场起点为什么几年后大家差距很大如果想快速升职加薪有什么好方法吗如何才能做到持续且快速的成长这些疑惑都离不开一个本质问题职场价值因为企业用人的核心出发点是你能否为企业创造价值你的价值和薪酬职级
MSP430 EEPROM-24C512使用总结及代码说明

MSP430 EEPROM 24C512使用总结及代码说明 https wenku baidu com view 61f407d6f705cc175527094b html
hooks中useMemo和useCallback详解

要想学习useMemo必须要先知道React memo 这两者都有一定的优化作用 memo的作用当数据变化时代码会重新执行一遍但是子组件数据没有变化也会执行这个时候可以使用memo将子组件封装起来让子组件的数据只在发生改变时才会执
sudo rosdep init ERROR: cannot download default sources list from:

在sudo rosdep init时出现的错误ERROR cannot download default sources list from https raw githubusercontent com ros rosdistro mas
安装一个虚拟服务器,一个云服务器可以装虚拟机么

一个云服务器可以装虚拟机么内容精选换一换虚拟IP地址用于为网卡提供第二个IP地址同时支持与多个云服务器的网卡绑定从而实现多个云服务器之间的高可用性登录管理控制台单击管理控制台左上角的选择区域和项目选择计算 gt 云耀云服
WEB前端命名规范

https www cnblogs com ysx215 p 7461777 html
数组指针行指针列指针

概念我们把指向数组的指针叫做数组指针后面还会学到指针数组这两个是不一样的根据中学语文偏正词组的知识可以知道前者是指针后者是数组一般指针变量 int a 2 3 1 2 3 4 5 6 int P a 0 0 int p a 0
短视频账号矩阵系统如何技术嵌入Chatgpt？

将GPT Generative Pre trained Transformer 嵌入短视频账号矩阵系统需要以下步骤 1 获取GPT模型可以自行训练或使用开源的预训练模型如GPT 2 GPT 3等 2 导入GPT模型将GPT模型导入到短
Metronic学习-1-替换google字体，让页面打开更流畅

Metronic是一款强大的后台模板包括很多组件接触过很多后台模板有Layui AdminLTE Inspinia hui 感觉Layui适合快速开发 Layui封闭性很强对于前端不太熟悉的话只能按模仿如果需要深入学习需要花费
Html-根据不同的分辨率设置不同的背景图片

media only screen and min width 1024px 当分辨率width gt 1024px 时使用1 jpg作为背景图片 bg background url images 1 jpg no repeat media
Reactor模型与Proactor模型

1 Reactor模型 1 1 什么是Reactor模式它是基于IO多路复用与线程池 Reactor模式的核心组成部分包括Reactor和处理资源池进程池或线程池 Reactor负责监听和分配事件处理资源池负责处理事件 Reactor
详解用 matplotlib 绘制动态条形图

详解用 matplotlib 绘制动态条形图端午安康近日看到联合国网站提供的世界人口数据集其中一个子数据集包含了各国 1950 2015年的人口数据假日值班有自由的时间就基于这个数据集用 matplotlib 实现了一个世界人
无法打开文件“xxx.lib”错误的解决办法

原因 pragma comment lib xxx lib 默认和引用的CPP文件在一个文件夹中解决方法1 将xxx lib和调用pragma comment的源文件放在一个目录注意是调用它的源文件不是头文件解决方法2 也可以在xx
UE4UE5 打包安卓报错总结UnrealBuildTool failed解决

报错 Android armv7 gradle rungradle bat UnrealBuildTool failed 解决方法 1 替换gradle包下载地址 http services gradle org distribution
0.目标检测基础知识

1 IOU交并比 1 交并比 import cv2 import numpy as np img np zeros 512 512 3 np uint8 此大小的黑色画布 img fill 255 画布填255 变成白色画布 RecA 50
H264/AVC-帧内预测相邻像素推导过程

帧内预测过程会以相邻块的像素值做参考来预测当前块的像素值以Intra 4x4为例如下图所示需要用到的13个相邻像素值那么如何获取这13个像素值本文要主要说明如何获取帧内预测所用到的相邻像素对应参考文档6 4 5 6 4 9小节
impala高级设置set之APPX_COUNT_DISTINCT

官网地址 https impala apache org docs build html topics impala appx count distinct html When the APPX COUNT DISTINCT query o

impala高级设置set之APPX_COUNT_DISTINCT

实战

NDV()

impala高级设置set之APPX_COUNT_DISTINCT 的相关文章

随机推荐

热门标签