HanLP 分词器 #

HanLP 是一个功能强大的中文自然语言处理库，通过 analysis-hanlp 插件集成到 Easysearch 中。该插件提供了 7 种分词模式，覆盖从高速到高精度的各种需求。

前提条件 #

bin/easysearch-plugin install analysis-hanlp

分词模式一览 #

分词器名称	模式	速度	精度	适用场景
`hanlp_standard`	标准分词	★★★	★★★★	通用中文分词
`hanlp_index`	索引分词	★★★	★★★	索引时最大化召回
`hanlp_nlp`	NLP 分词	★★	★★★★★	命名实体识别
`hanlp_crf`	CRF 分词	★★	★★★★★	新词发现
`hanlp_n_short`	N-最短路径	★★	★★★★	歧义消解
`hanlp_dijkstra`	最短路径	★★★	★★★	快速精确分词
`hanlp_speed`	极速分词	★★★★★	★★	大数据量高吞吐

索引/搜索推荐搭配 #

PUT my-hanlp-index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "hanlp_index_analyzer": {
          "type": "custom",
          "tokenizer": "hanlp_index"
        },
        "hanlp_search_analyzer": {
          "type": "custom",
          "tokenizer": "hanlp_standard"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "hanlp_index_analyzer",
        "search_analyzer": "hanlp_search_analyzer"
      }
    }
  }
}

测试分词 #

GET /_analyze
{
  "tokenizer": "hanlp_standard",
  "text": "中华人民共和国国歌"
}

模式选择建议 #

需求	推荐模式
通用场景	`hanlp_standard`
索引时最大召回	`hanlp_index`
人名/地名/机构名识别	`hanlp_nlp`
识别新词（训练语料外的词）	`hanlp_crf`
追求最大吞吐量	`hanlp_speed`

HanLP 分词器 #

前提条件 #

分词模式一览 #

索引/搜索推荐搭配 #

测试分词 #

模式选择建议 #

相关链接 #