标签:software last 拆分 tom roo https use dict keyword
把文本转换为一个个的单词,分词称之为analysis。es默认只对英文语句做分词,中文不支持,每个中文字都会被拆分为独立的个体。
POST http://192.168.247.8:9200/_analyze
{
"analyzer":"standard",
"text":"good good study"
}
# 返回
{
"tokens": [
{
"token": "good",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "good",
"start_offset": 5,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "study",
"start_offset": 10,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 2
}
]
}
如果想在某个索引下进行分词
POST /my_doc/_analyze
{
"analyzer": "standard",
"field": "name",
"text": "text文本"
}
standard:默认分词,单词会被拆分,大小会转换为小写。
simple:按照非字母分词。大写转为小写。
whitespace:按照空格分词。忽略大小写。
stop:去除无意义单词,比如the/a/an/is…
keyword:不做分词。把整个文本作为一个单独的关键词
Github:https://github.com/medcl/elasticsearch-analysis-ik
这里需要选择和你的es版本一致的ik。我的是7.5.1
[root@localhost software]# ls
elasticsearch-7.5.1-linux-x86_64.tar.gz elasticsearch-analysis-ik-7.5.1.zip
[root@localhost software]# unzip elasticsearch-analysis-ik-7.5.1.zip -d /usr/local/elasticsearch-7.5.1/plugins/ik
ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合,适合 Term Query;
ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”,适合 Phrase 查询。
POST http://192.168.247.8:9200/_analyze
{
"analyzer":"ik_max_word",
"text":"上下班做公交"
}
# 返回
{
"tokens": [
{
"token": "上下班",
"start_offset": 0,
"end_offset": 3,
"type": "CN_WORD",
"position": 0
},
{
"token": "上下",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": "下班",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 2
},
{
"token": "做",
"start_offset": 3,
"end_offset": 4,
"type": "CN_CHAR",
"position": 3
},
{
"token": "公交",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 4
}
]
}
1.进入IKAnalyzer.cfg.xml 配置如下
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict">custom.dic</entry>
2.保存后 再同级目录下建立custom.dic
[esuser@localhost config]$ cat custom.dic
崔神
牛皮
3.重启es
4.测试
POST http://192.168.247.8:9200/_analyze
{
"analyzer":"ik_smart",
"text":"崔神牛皮"
}
# 返回
{
"tokens": [
{
"token": "崔神",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "牛皮",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
}
]
}
标签:software last 拆分 tom roo https use dict keyword
原文地址:https://www.cnblogs.com/zhenghengbin/p/12286400.html