es学习(三)：分词器介绍以及中文分词器ik的安装与使用

时间：2020-02-09 12:03:10 阅读：87 评论：0 收藏：0 [点我收藏+]

标签：software last 拆分 tom roo https use dict keyword

什么是分词

把文本转换为一个个的单词，分词称之为analysis。es默认只对英文语句做分词，中文不支持，每个中文字都会被拆分为独立的个体。

示例

POST http://192.168.247.8:9200/_analyze

{
    "analyzer":"standard",
    "text":"good good study"
}

# 返回

{
    "tokens": [
        {
            "token": "good",
            "start_offset": 0,
            "end_offset": 4,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "good",
            "start_offset": 5,
            "end_offset": 9,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "study",
            "start_offset": 10,
            "end_offset": 15,
            "type": "<ALPHANUM>",
            "position": 2
        }
    ]
}

如果想在某个索引下进行分词

POST /my_doc/_analyze
{
    "analyzer": "standard",
    "field": "name",
    "text": "text文本"
}

es内置分词器

standard：默认分词，单词会被拆分，大小会转换为小写。
simple：按照非字母分词。大写转为小写。
whitespace：按照空格分词。忽略大小写。
stop：去除无意义单词，比如the/a/an/is…
keyword：不做分词。把整个文本作为一个单独的关键词

建立ik中文分词器

下载

Github：https://github.com/medcl/elasticsearch-analysis-ik

这里需要选择和你的es版本一致的ik。我的是7.5.1

解压

[root@localhost software]# ls
elasticsearch-7.5.1-linux-x86_64.tar.gz  elasticsearch-analysis-ik-7.5.1.zip
[root@localhost software]# unzip elasticsearch-analysis-ik-7.5.1.zip -d /usr/local/elasticsearch-7.5.1/plugins/ik

重启es

ik_max_word 和 ik_smart 什么区别?

ik_max_word: 会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”，会穷尽各种可能的组合，适合 Term Query；
ik_smart: 会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”，适合 Phrase 查询。

测试

POST http://192.168.247.8:9200/_analyze

{
    "analyzer":"ik_max_word",
    "text":"上下班做公交"
}

# 返回

{
    "tokens": [
        {
            "token": "上下班",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "上下",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "下班",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "做",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_CHAR",
            "position": 3
        },
        {
            "token": "公交",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 4
        }
    ]
}

自定义中文词库

1.进入IKAnalyzer.cfg.xml 配置如下

    <!--用户可以在这里配置自己的扩展字典 -->
    <entry key="ext_dict">custom.dic</entry>

2.保存后再同级目录下建立custom.dic

[esuser@localhost config]$  cat custom.dic 
崔神
牛皮

3.重启es
4.测试

POST http://192.168.247.8:9200/_analyze
{
    "analyzer":"ik_smart",
    "text":"崔神牛皮"
}

# 返回

{
    "tokens": [
        {
            "token": "崔神",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "牛皮",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

es学习(三)：分词器介绍以及中文分词器ik的安装与使用

标签：software last 拆分 tom roo https use dict keyword

原文地址：https://www.cnblogs.com/zhenghengbin/p/12286400.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行