映射字符过滤器

映射字符过滤器 #

映射(mapping)字符过滤器接受一个用于字符替换的键值对映射。每当该过滤器遇到与某个键匹配的字符串时,它就会用相应的值来替换这些字符。替换值可以是空字符串。

该过滤器采用贪婪匹配方式,这意味着会匹配最长的匹配结果。

在分词过程之前,需要进行特定文本替换的场景下,映射字符过滤器会很有帮助。

参考样例 #

以下请求配置了一个映射字符过滤器,该过滤器可将罗马数字(如 I、II 或 III)转换为对应的阿拉伯数字(1、2 和 3):

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [
        "I => 1",
        "II => 2",
        "III => 3",
        "IV => 4",
        "V => 5"
      ]
    }
  ],
  "text": "I have III apples and IV oranges"
}

返回内容中包含一个词元,其中罗马数字已被替换为阿拉伯数字:

{
  "tokens": [
    {
      "token": "1 have 3 apples and 4 oranges",
      "start_offset": 0,
      "end_offset": 32,
      "type": "word",
      "position": 0
    }
  ]
}

参数说明 #

你可以使用以下任意一个参数来配置键值映射。

参数必需/可选数据类型描述
mappings可选数组格式为 key => value 的键值对数组。在输入文本中找到的每个键都将被其对应的值替换。
mappings_path可选字符串包含键值映射的 UTF-8 编码文件的路径。每个映射应在新的一行中以 key => value 的格式呈现。该路径可以是绝对路径,也可以是相对于 Easysearch 配置目录的相对路径。

使用自定义映射字符过滤器 #

你可以通过定义自己的映射集来创建自定义映射字符过滤器。以下请求将创建一个自定义字符过滤器,用于替换文本中的常见缩写:

PUT /test-index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_abbr_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "custom_abbr_filter"
          ]
        }
      },
      "char_filter": {
        "custom_abbr_filter": {
          "type": "mapping",
          "mappings": [
            "BTW => By the way",
            "IDK => I don't know",
            "FYI => For your information"
          ]
        }
      }
    }
  }
}

使用以下请求来检查使用该分词器生成的词元:

GET /test-index/_analyze
{
  "tokenizer": "keyword",
  "char_filter": [ "custom_abbr_filter" ],
  "text": "FYI, updates to the workout schedule are posted. IDK when it takes effect, but we have some details. BTW, the finalized schedule will be released Monday."
}

返回内容显示这些缩写已被替换:

{
  "tokens": [
    {
      "token": "For your information, updates to the workout schedule are posted. I don't know when it takes effect, but we have some details. By the way, the finalized schedule will be released Monday.",
      "start_offset": 0,
      "end_offset": 153,
      "type": "word",
      "position": 0
    }
  ]
}