41.分词器简单介绍

时间：2018-02-25 19:18:16 阅读：162 评论：0 收藏：0 [点我收藏+]

标签：程序 post 演示搜索索引对比 sha lan elastics

主要知识点

1、什么是分词器

分词器就是把一个文档切分成词语，也就是es中所做的normalization（提升recall召回率）

recall，召回率：搜索的时候，增加能够搜索到的结果的数量。

经过分词器分词之后，es才能建立倒排索引

2、内置分词器的介绍

es内置种分词器，他们分别是、standard analyzer、simple analyzer、whitespace analyzer、language analyzer。所以如果是中文还要程序员自动手动安装中文分词器

假设有如下一段话：

Set the shape to semi-transparent by calling set_trans(5)

standard analyzer：结果是 set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默认的是standard）
simple analyzer：结果是 set, the, shape, to, semi, transparent, by, calling, set, trans
whitespace analyzer：结果是 Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
language analyzer（特定的语言的分词器，比如说，english，英语分词器）：结果是 set, shape, semi, transpar, call, set_tran, 5

3、其他说明

Elasticsearch中，内置了很多分词器（analyzers），例如standard （标准分词器）、english（英文分词）和chinese （中文分词）。其中standard 就是无脑的一个一个词（汉字）切分，所以适用范围广，但是精准度低；english 对英文更加智能，可以识别单数负数，大小写，过滤stopwords（例如"the"这个词）等；chinese 效果很差，后面会演示。这次主要玩这几个内容：安装中文分词ik，对比不同分词器的效果，得出一个较佳的配置。关于Elasticsearch，两篇很有用的文章：Elasticsearch的安装，运行和基本配置和备份和恢复，需要的可以看下。

41.分词器简单介绍

标签：程序 post 演示搜索索引对比 sha lan elastics

原文地址：https://www.cnblogs.com/liuqianli/p/8469682.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行