8.2.3-elasticsearch内置分词器之keyword/pattern

2023-11-14

ES默认提供了八种内置的analyzer,针对不同的场景可以使用不同的analyzer;

1、keyword analyzer

1.1、keyword类型及分词效果

keyword analyzer视字符串为一个整体不进行分词处理

//测试keyword analyzer默认分词效果
//请求参数
POST _analyze
{
  "analyzer": "keyword",
  "text": "The aggregations framework helps provide aggregated data based on a search query"
}
//结果返回
{
  "tokens" : [
    {
      "token" : "The aggregations framework helps provide aggregated data based on a search query",
      "start_offset" : 0,
      "end_offset" : 80,
      "type" : "word",
      "position" : 0
    }
  ]
}

以上句子通过分词之后得到的词(term)为:
[The aggregations framework helps provide aggregated data based on a search query]

1.2、keyword analyzer的组成定义

序号	子构件	构件说明
1	Tokenizer	keyword tokenizer

如果希望自定义一个与keyword类似的analyzer,只需要在在自定义analyzer时指定type为keyword,其它的可以按照需要进行配置(char filter/filter),如下示例:

//自定义kwyword analyzer
PUT custom_rebuild_keyword_analyzer_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuild_keyword_analyzer":{
          "tokenizer":"keyword",
          "filter":[]
        }
      }
    }
  }
}

2、pattern analyzer

2.1、pattern类型及分词效果

pattern analyzer使用正则表达式作为文本分词规则,注意对正则的转义处理,避免正则匹配到正则本身的字符串,默认正则为\W+(匹配非单词字符)

//测试pattern analyzer默认分词效果
//请求参数
POST _analyze
{
  "analyzer": "pattern",
  "text": "It's a nice day"
}
//结果返回
{
  "tokens" : [
    {
      "token" : "it",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "nice",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    }
  ]
}

以上句子通过分词之后得到的词(term)为:
[it, s, a, nice, day]

2.2、pattern类型可配置参数

序号	参数	参数说明
1	pattern	java正则表达式,默认为\W+
2	flags	java正则当中的flags,flags应该以管道形式分割,如’CASE_INSENSITIVE
3	lowercase	分隔出的词(term)是否以小写形式表示,默认为true
4	stopwords	预定义的停用词,可以为0个或多个,例如_english_或数组类型值,默认值为_none_
5	stopwords_path	停用词文件路径

1)、以下示例正则匹配非单词字符与’_'字符,并将最后的词(term)转化为小写

//自定义邮件格式拆分analyzer
PUT custom_rebuild_pattern_email_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "email_analyzer":{
          "type":"pattern",
          "pattern":"\\W|_",
          "lowercase":true
        }
      }
    }
  }
}
//请求参数
POST custom_rebuild_pattern_email_index/_analyze
{
  "analyzer": "email_analyzer", 
  "text": "Ruyin_Zh@foo-bar.com"
}
//分词结果
{
  "tokens" : [
    {
      "token" : "ruyin",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "zh",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "foo",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "bar",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "com",
      "start_offset" : 17,
      "end_offset" : 20,
      "type" : "word",
      "position" : 4
    }
  ]
}

以上句子通过分词之后得到的词(term)为:
[ruyin, zh, foo, bar, com]

2)、驼峰格式分词

//自定义驼峰格式analyzer
PUT custom_rebuild_pattern_camel_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "camel": {
          "type": "pattern",
          "pattern": """([^\p{L}\d+]+)|(?<=\D)(?=\d)|(?<=\d)(?<=\D)|(?<=[\p{L}&&[^\p{Lu}]])(?=\p{Lu})|(?<=\p{Lu})(?=\p{Lu})(?=\p{Lu}[\p{L}&&[^\p{Lu}]])"""
        }
      }
    }
  }
}
//请求参数
POST custom_rebuild_pattern_camel_index/_analyze
{
  "analyzer": "camel",
  "text": "MooseX::FTPClass2_beta"
}
//分词结果
{
  "tokens" : [
    {
      "token" : "moose",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "x",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "ftp",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "class",
      "start_offset" : 11,
      "end_offset" : 16,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "2",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "beta",
      "start_offset" : 18,
      "end_offset" : 22,
      "type" : "word",
      "position" : 5
    }
  ]
}

以上句子通过分词之后得到的词(term)为:
[moose, x, ftp, class, 2, beta]

以上正则表达式解释
  ([^\p{L}\d]+)                             # 剔除非数字或字符的字符,
| (?<=\D)(?=\d)                             # 或者 非数值字符后面跟数值,
| (?<=\d)(?=\D)                             # 或者 数值后面跟非数字字符,
| (?<=[ \p{L} && [^\p{Lu}]])(?=\p{Lu})      # 或者 小写字符后面跟着大写字符,
| (?<=\p{Lu})(?=\p{Lu}[\p{L}&&[^\p{Lu}]])   # 或者 大写字符后面跟大写字符且之后跟小写字符

2.3、pattern analyzer的组成定义

序号	子构件	构件说明
1	Tokenizer	pattern tokenizer
2	Token Filters	lowercase token filter,stop token filter(默认禁用)

如果希望自定义一个与pattern类似的analyzer,只需要在原定义中配置可配置参数即可,其它的可以完全照搬pattern的配置,如下示例:

//自定义analyzer
PUT custom_redefine_pattern_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_pattern":{
          "tokenizer":"split_on_non_word",
          "filter":["lowercase"]
        }
      },
      "tokenizer": {
        "split_on_non_word":{
          "type":"pattern",
          "pattern":"\\W+"
        }
      }
    }
  }
}

//请求参数
POST custom_redefine_pattern_index/_analyze
{
  "analyzer": "rebuilt_pattern",
  "text": "It's a nice day"
}
//分词结果
{
  "tokens" : [
    {
      "token" : "it",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "nice",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    }
  ]
}

以上句子通过分词之后得到的词(term)为:
[it, s, a, nice, day]

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

ELK

elasticsearch