ES默认提供了八种内置的analyzer,针对不同的场景可以使用不同的analyzer;
1、keyword analyzer
1.1、keyword类型及分词效果
keyword analyzer视字符串为一个整体不进行分词处理
//测试keyword analyzer默认分词效果
//请求参数
POST _analyze
{
"analyzer": "keyword",
"text": "The aggregations framework helps provide aggregated data based on a search query"
}
//结果返回
{
"tokens" : [
{
"token" : "The aggregations framework helps provide aggregated data based on a search query",
"start_offset" : 0,
"end_offset" : 80,
"type" : "word",
"position" : 0
}
]
}
以上句子通过分词之后得到的词(term)为:
[The aggregations framework helps provide aggregated data based on a search query]
1.2、keyword analyzer的组成定义
序号 |
子构件 |
构件说明 |
1 |
Tokenizer |
keyword tokenizer |
如果希望自定义一个与keyword类似的analyzer,只需要在在自定义analyzer时指定type为keyword,其它的可以按照需要进行配置(char filter/filter),如下示例:
//自定义kwyword analyzer
PUT custom_rebuild_keyword_analyzer_index
{
"settings": {
"analysis": {
"analyzer": {
"rebuild_keyword_analyzer":{
"tokenizer":"keyword",
"filter":[]
}
}
}
}
}
2、pattern analyzer
2.1、pattern类型及分词效果
pattern analyzer使用正则表达式作为文本分词规则,注意对正则的转义处理,避免正则匹配到正则本身的字符串,默认正则为\W+(匹配非单词字符)
//测试pattern analyzer默认分词效果
//请求参数
POST _analyze
{
"analyzer": "pattern",
"text": "It's a nice day"
}
//结果返回
{
"tokens" : [
{
"token" : "it",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "s",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 1
},
{
"token" : "a",
"start_offset" : 5,
"end_offset" : 6,
"type" : "word",
"position" : 2
},
{
"token" : "nice",
"start_offset" : 7,
"end_offset" : 11,
"type" : "word",
"position" : 3
},
{
"token" : "day",
"start_offset" : 12,
"end_offset" : 15,
"type" : "word",
"position" : 4
}
]
}
以上句子通过分词之后得到的词(term)为:
[it, s, a, nice, day]
2.2、pattern类型可配置参数
序号 |
参数 |
参数说明 |
1 |
pattern |
java正则表达式,默认为\W+ |
2 |
flags |
java正则当中的flags,flags应该以管道形式分割,如’CASE_INSENSITIVE |
3 |
lowercase |
分隔出的词(term)是否以小写形式表示,默认为true |
4 |
stopwords |
预定义的停用词,可以为0个或多个,例如_english_或数组类型值,默认值为_none_ |
5 |
stopwords_path |
停用词文件路径 |
1)、以下示例正则匹配非单词字符与’_'字符,并将最后的词(term)转化为小写
//自定义邮件格式拆分analyzer
PUT custom_rebuild_pattern_email_index
{
"settings": {
"analysis": {
"analyzer": {
"email_analyzer":{
"type":"pattern",
"pattern":"\\W|_",
"lowercase":true
}
}
}
}
}
//请求参数
POST custom_rebuild_pattern_email_index/_analyze
{
"analyzer": "email_analyzer",
"text": "Ruyin_Zh@foo-bar.com"
}
//分词结果
{
"tokens" : [
{
"token" : "ruyin",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "zh",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "foo",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 2
},
{
"token" : "bar",
"start_offset" : 13,
"end_offset" : 16,
"type" : "word",
"position" : 3
},
{
"token" : "com",
"start_offset" : 17,
"end_offset" : 20,
"type" : "word",
"position" : 4
}
]
}
以上句子通过分词之后得到的词(term)为:
[ruyin, zh, foo, bar, com]
2)、驼峰格式分词
//自定义驼峰格式analyzer
PUT custom_rebuild_pattern_camel_index
{
"settings": {
"analysis": {
"analyzer": {
"camel": {
"type": "pattern",
"pattern": """([^\p{L}\d+]+)|(?<=\D)(?=\d)|(?<=\d)(?<=\D)|(?<=[\p{L}&&[^\p{Lu}]])(?=\p{Lu})|(?<=\p{Lu})(?=\p{Lu})(?=\p{Lu}[\p{L}&&[^\p{Lu}]])"""
}
}
}
}
}
//请求参数
POST custom_rebuild_pattern_camel_index/_analyze
{
"analyzer": "camel",
"text": "MooseX::FTPClass2_beta"
}
//分词结果
{
"tokens" : [
{
"token" : "moose",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "x",
"start_offset" : 5,
"end_offset" : 6,
"type" : "word",
"position" : 1
},
{
"token" : "ftp",
"start_offset" : 8,
"end_offset" : 11,
"type" : "word",
"position" : 2
},
{
"token" : "class",
"start_offset" : 11,
"end_offset" : 16,
"type" : "word",
"position" : 3
},
{
"token" : "2",
"start_offset" : 16,
"end_offset" : 17,
"type" : "word",
"position" : 4
},
{
"token" : "beta",
"start_offset" : 18,
"end_offset" : 22,
"type" : "word",
"position" : 5
}
]
}
以上句子通过分词之后得到的词(term)为:
[moose, x, ftp, class, 2, beta]
以上正则表达式解释
([^\p{L}\d]+) # 剔除非数字或字符的字符,
| (?<=\D)(?=\d) # 或者 非数值字符后面跟数值,
| (?<=\d)(?=\D) # 或者 数值后面跟非数字字符,
| (?<=[ \p{L} && [^\p{Lu}]])(?=\p{Lu}) # 或者 小写字符后面跟着大写字符,
| (?<=\p{Lu})(?=\p{Lu}[\p{L}&&[^\p{Lu}]]) # 或者 大写字符后面跟大写字符且之后跟小写字符
2.3、pattern analyzer的组成定义
序号 |
子构件 |
构件说明 |
1 |
Tokenizer |
pattern tokenizer |
2 |
Token Filters |
lowercase token filter,stop token filter(默认禁用) |
如果希望自定义一个与pattern类似的analyzer,只需要在原定义中配置可配置参数即可,其它的可以完全照搬pattern的配置,如下示例:
//自定义analyzer
PUT custom_redefine_pattern_index
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_pattern":{
"tokenizer":"split_on_non_word",
"filter":["lowercase"]
}
},
"tokenizer": {
"split_on_non_word":{
"type":"pattern",
"pattern":"\\W+"
}
}
}
}
}
//请求参数
POST custom_redefine_pattern_index/_analyze
{
"analyzer": "rebuilt_pattern",
"text": "It's a nice day"
}
//分词结果
{
"tokens" : [
{
"token" : "it",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "s",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 1
},
{
"token" : "a",
"start_offset" : 5,
"end_offset" : 6,
"type" : "word",
"position" : 2
},
{
"token" : "nice",
"start_offset" : 7,
"end_offset" : 11,
"type" : "word",
"position" : 3
},
{
"token" : "day",
"start_offset" : 12,
"end_offset" : 15,
"type" : "word",
"position" : 4
}
]
}
以上句子通过分词之后得到的词(term)为:
[it, s, a, nice, day]