Elasticsearch去重查询/过滤重复数据（聚合）

2023-05-16

我司某个环境的es中被导入了重复数据，导致查询的时候会出现一些重复数据，所以要我们几个开发想一些解决方案，我们聊了聊，出了下面一些方案：
1.从源头解决：导入数据时进行唯一性校验
2.从数据解决：清洗数据，将重复的数据查出后清理，然后入库
3.从查询解决：查询时筛选重复数据

我就从查询着手，找到了聚合查询的方法
在这里插入图片描述

聚合(Aggregations)

聚合功能为ES带来了统计分析的能力，类似于SQL语言中的group by，avg，sum等函数

桶(Buckets)：符合条件的文档的集合，相当于SQL中的group by

桶的概念在很多地方有应用，比如桶排序，HashMap的实现中数组也可看作桶，等等等等

示例：
根据city，对twitter索引的文档进行分组
aggs：聚合
my：自定义名称
terms：根据结果分类
field：筛选字段
city：需要分类的字段

GET /twitter/doc/_search
{
	"from": 0,
	"size": 0,
	"aggs": {
	  "my":{
	    "terms":{
	      "field": "city"
	    }
	  }
	}
}

结果中聚合的部分：
计算出了类型和命中的数量

"aggregations": {
    "my": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "北京",
          "doc_count": 105
        },
        {
          "key": "上海",
          "doc_count": 1
        }
      ]
    }
  }

但这不是只有统计结果吗，我要的是筛选后的数据啊
在这里插入图片描述

top_hits指标聚合器

top_hits指标聚合器跟踪要聚合的最相关文档，可以有效地用于通过存储桶聚合器按某些字段对结果集进行分组。
选项：
from-要获取的第一个结果的偏移量。
size-每个存储桶要返回的最匹配匹配项的最大数目。默认情况下，返回前三个匹配项。
排序-匹配的热门匹配的排序方式。默认情况下，命中按主要查询的分数排序。

示例：
根据city，对twitter索引的文档进行分组、根据age进行排序、结果只包含user+age+city，然后显示每组的一条数据
aggs：聚合
my：自定义名称
terms：根据结果分类
field：筛选字段
city：需要分类的字段
sort：排序
age：排序依据字段
order：排序方式
desc：降序
_source includes：结果包含的字段
size：每组显示的数量

{
	"from": 0,
	"size": 0,
	"aggs": {
	  "my":{
	    "terms":{
	      "field": "city"
	    },
	    "aggs":{
	      "my_top_hits":{
	        "top_hits":{
	          "sort": [
              {
                "age": {
                  "order": "desc"
                }
              }
            ],
            "_source": {
              "includes": [
                "user",
                "age",
                "city"
              ]
            },
	          "size":1
	        }
	      }
	    }
	  }
	}
}

结果中聚合的部分：

"aggregations": {
    "my": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "北京",
          "doc_count": 105,
          "my_top_hits": {
            "hits": {
              "total": 105,
              "max_score": null,
              "hits": [
                {
                  "_index": "twitter",
                  "_type": "doc",
                  "_id": "AW5jwgirrweXGTc7-cPA",
                  "_score": null,
                  "_source": {
                    "city": "北京",
                    "user": "朝阳区-老王",
                    "age": 50
                  },
                  "sort": [
                    50
                  ]
                }
              ]
            }
          }
        },
        {
          "key": "上海",
          "doc_count": 1,
          "my_top_hits": {
            "hits": {
              "total": 1,
              "max_score": null,
              "hits": [
                {
                  "_index": "twitter",
                  "_type": "doc",
                  "_id": "AW5jwiM1rweXGTc7-cPB",
                  "_score": null,
                  "_source": {
                    "city": "上海",
                    "user": "虹桥-老吴",
                    "age": 90
                  },
                  "sort": [
                    90
                  ]
                }
              ]
            }
          }
        }
      ]
    }
  }

但是光使用terms，我添加了多个字段后查不出来东西了都，难道这样还不行吗
在这里插入图片描述

使用script进行聚合

常规的聚合无法在聚合中进行复杂操作，所以要加入脚本
示例：
修改terms中内容为下，将三个条件拼接起来

"terms":{
	      "script": "doc['user.keyword'].value + '#' + doc['age'].value + '#' +doc['city'].value"
	    },

查询结果：
key：拼接的条件
doc_count：每组重复的数目

"aggregations": {
    "my": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "双榆树-张三#20#北京",
          "doc_count": 101,
          "my_top_hits": {
            "hits": {
              "total": 101,
              "max_score": null,
              "hits": [
                {
                  "_index": "twitter",
                  "_type": "doc",
                  "_id": "AW9lr8sBP5iHlpen8GYt",
                  "_score": null,
                  "_source": {
                    "city": "北京",
                    "user": "双榆树-张三",
                    "age": 20
                  },
                  "sort": [
                    20
                  ]
                }
              ]
            }
          }
        },
        {
          "key": "东城区-李四#30#北京",
          "doc_count": 1,
          "my_top_hits": {
            "hits": {
              "total": 1,
              "max_score": null,
              "hits": [
                {
                  "_index": "twitter",
                  "_type": "doc",
                  "_id": "AW5jwaOIrweXGTc7-cO-",
                  "_score": null,
                  "_source": {
                    "city": "北京",
                    "user": "东城区-李四",
                    "age": 30
                  },
                  "sort": [
                    30
                  ]
                }
              ]
            }
          }
        },
        {
          "key": "东城区-老刘#30#北京",
          "doc_count": 1,
          "my_top_hits": {
            "hits": {
              "total": 1,
              "max_score": null,
              "hits": [
                {
                  "_index": "twitter",
                  "_type": "doc",
                  "_id": "AW5jwXhcrweXGTc7-cO9",
                  "_score": null,
                  "_source": {
                    "city": "北京",
                    "user": "东城区-老刘",
                    "age": 30
                  },
                  "sort": [
                    30
                  ]
                }
              ]
            }
          }
        },
        {
          "key": "朝阳区-老王#50#北京",
          "doc_count": 1,
          "my_top_hits": {
            "hits": {
              "total": 1,
              "max_score": null,
              "hits": [
                {
                  "_index": "twitter",
                  "_type": "doc",
                  "_id": "AW5jwgirrweXGTc7-cPA",
                  "_score": null,
                  "_source": {
                    "city": "北京",
                    "user": "朝阳区-老王",
                    "age": 50
                  },
                  "sort": [
                    50
                  ]
                }
              ]
            }
          }
        },
        {
          "key": "朝阳区-老贾#35#北京",
          "doc_count": 1,
          "my_top_hits": {
            "hits": {
              "total": 1,
              "max_score": null,
              "hits": [
                {
                  "_index": "twitter",
                  "_type": "doc",
                  "_id": "AW5jwcvBrweXGTc7-cO_",
                  "_score": null,
                  "_source": {
                    "city": "北京",
                    "user": "朝阳区-老贾",
                    "age": 35
                  },
                  "sort": [
                    35
                  ]
                }
              ]
            }
          }
        },
        {
          "key": "虹桥-老吴#90#上海",
          "doc_count": 1,
          "my_top_hits": {
            "hits": {
              "total": 1,
              "max_score": null,
              "hits": [
                {
                  "_index": "twitter",
                  "_type": "doc",
                  "_id": "AW5jwiM1rweXGTc7-cPB",
                  "_score": null,
                  "_source": {
                    "city": "上海",
                    "user": "虹桥-老吴",
                    "age": 90
                  },
                  "sort": [
                    90
                  ]
                }
              ]
            }
          }
        }
      ]
    }
  }

可以看到，每组都不一样，我们script真是太强大了
在这里插入图片描述

Java实现

使用elasticsearch包中的工具类，将索引中所有字段进行拼接，作为aggregation参数传入查询即可

总结

本文介绍了es的聚合功能，aggs+top_hits+script就能过滤重复数据，得到唯一结果。

–02020728
补充分页
Elasticsearch聚合后分页

这个分页以后有机会再说
在这里插入图片描述

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)