Ngrom
这个分词器在对中文人名
、药品名
等检索的时候很犀利,无脑的将词分隔成成为几个字连接起来。
先将定义的my_tokenizer
添加到索引my-index
中,代码如下所示
PUT my-index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]}}}}}
其中my_tokenizer
有 2 个参数min_gram
表示分词的最小分多少,max_gram
表示分词最大分多少。
接下来我们来测试下
POST my-index/_analyze
{
"analyzer": "my_analyzer",
"text": "梁朝伟"
}
结果是
{
"tokens" : [
{
"token" : "梁朝",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "梁朝伟",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
},
{
"token" : "朝伟",
"start_offset" : 1,
"end_offset" : 3,
"type" : "word",
"position" : 2
}
]
}
这种方式在检索名字
的时候,特别有效。
POST my-index/_analyze
{
"analyzer": "my_analyzer",
"text": "阿莫西林克拉维酸钾"
}
结果
{
"tokens" : [
{
"token" : "阿莫",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "阿莫西",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
},
{
"token" : "莫西",
"start_offset" : 1,
"end_offset" : 3,
"type" : "word",
"position" : 2
},
{
"token" : "莫西林",
"start_offset" : 1,
"end_offset" : 4,
"type" : "word",
"position" : 3
},
{
"token" : "西林",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 4
},
{
"token" : "西林克",
"start_offset" : 2,
"end_offset" : 5,
"type" : "word",
"position" : 5
},
{
"token" : "林克",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 6
},
{
"token" : "林克拉",
"start_offset" : 3,
"end_offset" : 6,
"type" : "word",
"position" : 7
},
{
"token" : "克拉",
"start_offset" : 4,
"end_offset" : 6,
"type" : "word",
"position" : 8
},
{
"token" : "克拉维",
"start_offset" : 4,
"end_offset" : 7,
"type" : "word",
"position" : 9
},
{
"token" : "拉维",
"start_offset" : 5,
"end_offset" : 7,
"type" : "word",
"position" : 10
},
{
"token" : "拉维酸",
"start_offset" : 5,
"end_offset" : 8,
"type" : "word",
"position" : 11
},
{
"token" : "维酸",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 12
},
{
"token" : "维酸钾",
"start_offset" : 6,
"end_offset" : 9,
"type" : "word",
"position" : 13
},
{
"token" : "酸钾",
"start_offset" : 7,
"end_offset" : 9,
"type" : "word",
"position" : 14
}
]
}
这样搜索就非常简单了。