2021-02-08 / 待分类

【ELK】 Ngrom分词器

概述

Ngrom这个分词器在对中文人名、药品名 等检索的时候很犀利，无脑的将词分隔成成为几个字连接起来。

先将定义的my_tokenizer添加到索引my-index中，代码如下所示

PUT my-index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]}}}}}

其中my_tokenizer有 2 个参数min_gram 表示分词的最小分多少，max_gram表示分词最大分多少。

中文人名

接下来我们来测试下

POST my-index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "梁朝伟"
}

结果是

{
  "tokens" : [
    {
      "token" : "梁朝",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "梁朝伟",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "朝伟",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 2
    }
  ]
}

这种方式在检索名字的时候，特别有效。

药物名

POST my-index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "阿莫西林克拉维酸钾"
}

结果

{
  "tokens" : [
    {
      "token" : "阿莫",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "阿莫西",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "莫西",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "莫西林",
      "start_offset" : 1,
      "end_offset" : 4,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "西林",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "西林克",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "林克",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "林克拉",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "克拉",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "克拉维",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "拉维",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "拉维酸",
      "start_offset" : 5,
      "end_offset" : 8,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "维酸",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 12
    },
    {
      "token" : "维酸钾",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "word",
      "position" : 13
    },
    {
      "token" : "酸钾",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 14
    }
  ]
}

这样搜索就非常简单了。

不积跬步，无以至千里。不积小流，无以成江海。

驽马十驾

驽马十驾，功在不舍

【ELK】 Ngrom分词器

概述

中文人名

药物名