映射字符过滤器 #

映射（mapping）字符过滤器接受一个用于字符替换的键值对映射。每当该过滤器遇到与某个键匹配的字符串时，它就会用相应的值来替换这些字符。替换值可以是空字符串。

该过滤器采用贪婪匹配方式，这意味着会匹配最长的匹配结果。

在分词过程之前，需要进行特定文本替换的场景下，映射字符过滤器会很有帮助。

参考样例 #

以下请求配置了一个映射字符过滤器，该过滤器可将罗马数字（如 I、II 或 III）转换为对应的阿拉伯数字（1、2 和 3）：

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [
        "I => 1",
        "II => 2",
        "III => 3",
        "IV => 4",
        "V => 5"
      ]
    }
  ],
  "text": "I have III apples and IV oranges"
}

返回内容中包含一个词元，其中罗马数字已被替换为阿拉伯数字：

{
  "tokens": [
    {
      "token": "1 have 3 apples and 4 oranges",
      "start_offset": 0,
      "end_offset": 32,
      "type": "word",
      "position": 0
    }
  ]
}

参数说明 #

你可以使用以下任意一个参数来配置键值映射。

参数	必需/可选	数据类型	描述
`mappings`	可选	数组	格式为 `key => value` 的键值对数组。在输入文本中找到的每个键都将被其对应的值替换。
`mappings_path`	可选	字符串	包含键值映射的 UTF-8 编码文件的路径。每个映射应在新的一行中以 `key => value` 的格式呈现。该路径可以是绝对路径，也可以是相对于 Easysearch 配置目录的相对路径。

使用自定义映射字符过滤器 #

你可以通过定义自己的映射集来创建自定义映射字符过滤器。以下请求将创建一个自定义字符过滤器，用于替换文本中的常见缩写：

PUT /test-index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_abbr_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "custom_abbr_filter"
          ]
        }
      },
      "char_filter": {
        "custom_abbr_filter": {
          "type": "mapping",
          "mappings": [
            "BTW => By the way",
            "IDK => I don't know",
            "FYI => For your information"
          ]
        }
      }
    }
  }
}

使用以下请求来检查使用该分词器生成的词元：

GET /test-index/_analyze
{
  "tokenizer": "keyword",
  "char_filter": [ "custom_abbr_filter" ],
  "text": "FYI, updates to the workout schedule are posted. IDK when it takes effect, but we have some details. BTW, the finalized schedule will be released Monday."
}

返回内容显示这些缩写已被替换：

{
  "tokens": [
    {
      "token": "For your information, updates to the workout schedule are posted. I don't know when it takes effect, but we have some details. By the way, the finalized schedule will be released Monday.",
      "start_offset": 0,
      "end_offset": 153,
      "type": "word",
      "position": 0
    }
  ]
}