ElasticSearch系列（四）ES集成IK分词器以及使用方式

2023-10-29

文章目录

ES的分词器会把我们查询使用的字符串进行分词，同时也会把要查询的目标数据进行分词，然后进行匹配。

一、标准分词器

ES默认自带的分析器，是标准分词器，针对英文好使，但是针对中文，只能把汉字一个个拆分，不符合中文插叙需求。

我们测试下标准分词器，看看标准分词器如何处理 “正在学习elastic search” 这个字符串：
在这里插入图片描述

{
    "tokens": [
        {
            "token": "正",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "在",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "学",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "习",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "elastic",
            "start_offset": 4,
            "end_offset": 11,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "serach",
            "start_offset": 12,
            "end_offset": 18,
            "type": "<ALPHANUM>",
            "position": 5
        }
    ]
}

可以看出标准分词器处理英文还可以，但是处理中文，结果并不理想，所以我们需要IK分词器。

二、IK分词器

IKAnalyzer是一个开源的，基于java开发的轻量级的中文分词工具包。ES默认没有携带IK分词器，需要下载，下载地址：
https://github.com/medcl/elasticsearch-analysis-ik/releases

2.1 下载安装

1.下载IK压缩包，本文使用ES是7.3.0，下载的IK也是7.3.0；
2.解压放到es安装目录中的plugin目录中，随便创建的文件目录，
在这里插入图片描述
3.重启es即可

2.2 测试效果

IK分词器，支持两种算法。分别为：

ik_smart ：最少切分
ik_max_word ：最细粒度切分

下面看效果，还是测试 “正在学习elastic search” 这个字符串。

1.ik_smart
在这里插入图片描述

{
    "tokens": [
        {
            "token": "正在",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "学习",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "elastic",
            "start_offset": 4,
            "end_offset": 11,
            "type": "ENGLISH",
            "position": 2
        },
        {
            "token": "serach",
            "start_offset": 12,
            "end_offset": 18,
            "type": "ENGLISH",
            "position": 3
        }
    ]
}

2.ik_max_word
在这里插入图片描述

{
    "tokens": [
        {
            "token": "正在",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "在学",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "学习",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "elastic",
            "start_offset": 4,
            "end_offset": 11,
            "type": "ENGLISH",
            "position": 3
        },
        {
            "token": "serach",
            "start_offset": 12,
            "end_offset": 18,
            "type": "ENGLISH",
            "position": 4
        }
    ]
}

对比结果，就能看出ik_max_word和 ik_smart 算法的区别，ik_max_word分出的词数更多更细一些。

2.3 自定义词库

我们分词“最好听的歌”这个字符串，结果如下

{
	"analyzer":"ik_smart",
	"text":"最好听的歌"	
}
输出：
{
    "tokens": [
        {
            "token": "最好",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "听的歌",
            "start_offset": 2,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

我们想要“最好听的歌”为一个完整的词，但是结果并没有，这个时候需要我们去词库添加这个词。

1.在es的插件目录中，我们添加了IK分词器，在分词器目录下，有个config目录，

/plugins/ik/config

在congif中，添加一个mydic.dic的文件，名字随意，后缀为dic；

2.在mydic.dic文件中添加词汇：

最好听的歌

3.保存后，修改在ik/config目录的IKAnalyzer.cfg.xml文，内容：
在这里插入图片描述
4.重启es；如果是es集群，每个节点都需要改；

测试下：

ik_smart：

{
	"analyzer":"ik_smart",
	"text":"最好听的歌"	
}
输出：
{
    "tokens": [
        {
            "token": "最好听的歌",
            "start_offset": 0,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 0
        }
    ]
}

ik_max_word：

{
	"analyzer":"ik_max_word",
	"text":"最好听的歌"	
}
输出：
{
    "tokens": [
        {
            "token": "最好听的歌",
            "start_offset": 0,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "最好",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "好听",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "听的歌",
            "start_offset": 2,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 3
        }
    ]
}

发现，都有“最好听的歌”这个词了。

本文就先到这里。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

环境搭建

elasticsearch