码迷,mamicode.com
首页 > 其他好文 > 详细

Elasticsearch之中文分词器插件es-ik

时间:2017-02-24 22:12:56      阅读:787      评论:0      收藏:0      [点我收藏+]

标签:logs   compiler   color   lease   target   offset   目录   hadoop   分词   

 

前提

什么是倒排索引?

Elasticsearch之分词器的作用

Elasticsearch之分词器的工作流程

Elasticsearch之停用

Elasticsearch之中文分词器

Elasticsearch之几个重要的分词器

 

 

 

 

 

 

 

elasticsearch官方默认的分词插件

  1、elasticsearch官方默认的分词插件,对中文分词效果不理想。

  比如,我现在,拿个具体实例来展现下,验证为什么,es官网提供的分词插件对中文分词而言,效果差。

技术分享

技术分享

[hadoop@HadoopMaster elasticsearch-2.4.3]$ jps
2044 Jps
1979 Elasticsearch
[hadoop@HadoopMaster elasticsearch-2.4.3]$ pwd
/home/hadoop/app/elasticsearch-2.4.3
[hadoop@HadoopMaster elasticsearch-2.4.3]$ curl ‘http://192.168.80.10:9200/zhouls/_analyze?pretty=true‘ -d ‘{"text":"这里是好记性不如烂笔头感叹号的博客园"}‘
{
"tokens" : [ {
"token" : "",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
}, {
"token" : "",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
}, {
"token" : "",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
}, {
"token" : "",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
}, {
"token" : "",

 

"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
}, {
"token" : "",
"start_offset" : 5,
"end_offset" : 6,
"type" : "<IDEOGRAPHIC>",
"position" : 5
}, {
"token" : "",
"start_offset" : 6,
"end_offset" : 7,
"type" : "<IDEOGRAPHIC>",
"position" : 6
}, {
"token" : "",
"start_offset" : 7,
"end_offset" : 8,
"type" : "<IDEOGRAPHIC>",
"position" : 7
}, {
"token" : "",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<IDEOGRAPHIC>",
"position" : 8
}, {
"token" : "",

"start_offset" : 9,
"end_offset" : 10,
"type" : "<IDEOGRAPHIC>",
"position" : 9
}, {
"token" : "",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<IDEOGRAPHIC>",
"position" : 10
}, {
"token" : "",
"start_offset" : 11,
"end_offset" : 12,
"type" : "<IDEOGRAPHIC>",
"position" : 11
}, {
"token" : "",
"start_offset" : 12,
"end_offset" : 13,
"type" : "<IDEOGRAPHIC>",
"position" : 12
}, {
"token" : "",
"start_offset" : 13,
"end_offset" : 14,
"type" : "<IDEOGRAPHIC>",
"position" : 13
}, {
"token" : "",

"start_offset" : 14,
"end_offset" : 15,
"type" : "<IDEOGRAPHIC>",
"position" : 14
}, {
"token" : "",
"start_offset" : 15,
"end_offset" : 16,
"type" : "<IDEOGRAPHIC>",
"position" : 15
}, {
"token" : "",
"start_offset" : 16,
"end_offset" : 17,
"type" : "<IDEOGRAPHIC>",
"position" : 16
}, {
"token" : "",
"start_offset" : 17,
"end_offset" : 18,
"type" : "<IDEOGRAPHIC>",
"position" : 17
} ]
}
[hadoop@HadoopMaster elasticsearch-2.4.3]$

 

总结

     如果直接使用Elasticsearch的朋友在处理中文内容的搜索时,肯定会遇到很尴尬的问题——中文词语被分成了一个一个的汉字,当用Kibana作图的时候,按照term来分组,结果一个汉字被分成了一组。

     这是因为使用了Elasticsearch中默认的标准分词器,这个分词器在处理中文的时候会把中文单词切分成一个一个的汉字,因此引入es之中文的分词器插件es-ik就能解决这个问题

 

 

 

 

 

 

 

 

 

如何集成IK分词工具

   总的流程如下:

1:下载es的IK插件https://github.com/medcl/elasticsearch-analysis-ik/tree/2.x

2:使用maven对下载的es-ik源码进行编译(mvn clean package -DskipTests)

3:把编译后的target/releases下的elasticsearch-analysis-ik-1.10.3.zip文件拷贝到ES_HOME/plugins/ik目录下面,然后使用unzip命令解压

如果unzip命令不存在,则安装:yum install -y unzip

4:重启es服务

5:测试分词效果: curl ‘http://your ip:9200/zhouls/_analyze?analyzer=ik_max_word&pretty=true‘ -d ‘{"text":"这里是好记性不如烂笔头感叹号的博客们"}‘

   注意:若你是单节点的es集群的话,则只需在一台部署es-ik。若比如像我这里的话,是3台,则需在三台都部署es-ik,且配置要一样。

 

 

 

  第一步:在浏览器里,输入https://github.com/

技术分享

 

 

 

 

  第二步:https://github.com/search?utf8=%E2%9C%93&q=elasticsearch-ik

技术分享

 

 

 

  

  第三步:https://github.com/medcl/elasticsearch-analysis-ik  ,点击2.x

技术分享

 技术分享

 

 

  第四步:https://github.com/medcl/elasticsearch-analysis-ik/tree/2.x 得到

技术分享

 

 

  第五步:找到之后,点击,下载,这里选择离线安装。

技术分享

  技术分享

 

 

 

  第六步:将Elasticsearch之中文分词器插件es-ik的压缩包解压下,初步认识下其目录结构,比如我这里放到D盘下来认识下。并为后续的maven编译做基础。

技术分享

 

  

 

 

  第七步:用本地安装好的maven来编译

 技术分享

Microsoft Windows [版本 6.1.7601]
版权所有 (c) 2009 Microsoft Corporation。保留所有权利。

C:\Users\Administrator>cd D:\elasticsearch-analysis-ik-2.x

C:\Users\Administrator>d:

D:\elasticsearch-analysis-ik-2.x>mvn

 

 

   得到,

 技术分享

 

 

 

技术分享

D:\elasticsearch-analysis-ik-2.x>mvn clean package -DskipTests
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building elasticsearch-analysis-ik 1.10.4
[INFO] ------------------------------------------------------------------------
Downloading: http://maven.aliyun.com/nexus/content/repositories/central/org/apac
he/maven/plugins/maven-enforcer-plugin/1.0/maven-enforcer-plugin-1.0.pom
Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/apach
e/maven/plugins/maven-enforcer-plugin/1.0/maven-enforcer-plugin-1.0.pom (7 KB at
2.5 KB/sec)
Downloading: http://maven.aliyun.com/nexus/content/repositories/central/org/apac
he/maven/enforcer/enforcer/1.0/enforcer-1.0.pom
Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/apach
e/maven/enforcer/enforcer/1.0/enforcer-1.0.pom (12 KB at 19.5 KB/sec)
Downloading: http://maven.aliyun.com/nexus/content/repositories/central/org/apac
he/maven/maven-parent/17/maven-parent-17.pom
Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/apach
e/maven/maven-parent/17/maven-parent-17.pom (25 KB at 41.9 KB/sec)
Downloading: http://maven.aliyun.com/nexus/content/repositories/central/org/apac
he/maven/plugins/maven-enforcer-plugin/1.0/maven-enforcer-plugin-1.0.jar
Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/apach
e/maven/plugins/maven-enforcer-plugin/1.0/maven-enforcer-plugin-1.0.jar (22 KB a
t 44.2 KB/sec)
Downloading: http://maven.aliyun.com/nexus/content/repositories/central/org/apac
he/maven/plugins/maven-compiler-plugin/3.5.1/maven-compiler-plugin-3.5.1.pom
Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/apach
e/maven/plugins/maven-compiler-plugin/3.5.1/maven-compiler-plugin-3.5.1.pom (10
KB at 35.3 KB/sec)
Downloading: http://maven.aliyun.com/nexus/content/repositories/central/org/apac
he/maven/plugins/maven-plugins/28/maven-plugins-28.pom
Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/apach
e/maven/plugins/maven-plugins/28/maven-plugins-28.pom (12 KB at 42.1 KB/sec)
Downloading: http://maven.aliyun.com/nexus/content/repositories/central/org/apac
he/maven/maven-parent/27/maven-parent-27.pom
Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/apach
e/maven/maven-parent/27/maven-parent-27.pom (40 KB at 94.0 KB/sec)
Downloading: http://maven.aliyun.com/nexus/content/repositories/central/org/apac
he/apache/17/apache-17.pom
Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/apach

 

   需要等待一会儿,这个根据自己的网速快慢。

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  这里,需要本地(即windows系统)里,提前安装好maven,需要来编译。若没安装的博友,请移步,见

Eclipse下Maven新建项目、自动打依赖jar包(包含普通项目和Web项目)

  

 

 

 

      最后得到是,

技术分享

 

Elasticsearch之中文分词器插件es-ik

标签:logs   compiler   color   lease   target   offset   目录   hadoop   分词   

原文地址:http://www.cnblogs.com/zlslch/p/6440373.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!