空格分词器 #
空格(whitespace
)分词器仅基于空白字符(例如,空格和制表符)将文本拆分为词元。比如转换为小写形式或移除停用词这样的转换操作,它都不会应用,因此文本的原始大小写形式会被保留,并且标点符号也会作为词元的一部分包含在内。
参考样例 #
以下命令创建一个名为 my_whitespace_index
并使用空格分词器的索引:
PUT /my_whitespace_index
{
"mappings": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "whitespace"
}
}
}
}
配置自定义分词器 #
以下命令来配置一个自定义分词器的索引,该自定义分词器的作用等同于空格分词器:
PUT /my_custom_whitespace_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "my_custom_whitespace_analyzer"
}
}
}
}
产生的词元 #
以下请求用来检查分词器生成的词元:
POST /my_custom_whitespace_index/_analyze
{
"analyzer": "my_custom_whitespace_analyzer",
"text": "The SLOW turtle swims away! 123"
}
返回内容中包含了产生的词元
{
"tokens": [
{"token": "the","start_offset": 0,"end_offset": 3,"type": "word","position": 0},
{"token": "slow","start_offset": 4,"end_offset": 8,"type": "word","position": 1},
{"token": "turtle","start_offset": 9,"end_offset": 15,"type": "word","position": 2},
{"token": "swims","start_offset": 16,"end_offset": 21,"type": "word","position": 3},
{"token": "away!","start_offset": 22,"end_offset": 27,"type": "word","position": 4},
{"token": "123","start_offset": 28,"end_offset": 31,"type": "word","position": 5}
]
}