标签:
一、IK简介
IK Analyzer是一个开源的,基于java语言开发的轻量级的中文分词工具包。从2006年12月推出1.0版开始, IKAnalyzer已经推出了4个大版本。最初,它是以开源项目Luence为应用主体的,结合词典分词和文法分析算法的中文分词组件。从3.0版本开 始,IK发展为面向Java的公用分词组件,独立于Lucene项目,同时提供了对Lucene的默认优化实现。在2012版本中,IK实现了简单的分词 歧义排除算法,标志着IK分词器从单纯的词典分词向模拟语义分词衍化。
IK Analyzer 2012特性:
二、配置编译环境
从Github下载的IK分词是源码包,需要maven环境编译
1、下载maven
# wget http://mirrors.hust.edu.cn/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
2、解压
# tar zxf apache-maven-3.3.9-bin.tar.gz -C /usr/src/
3、配置环境变量
# vi /etc/profile export MAVEN_HOME=/usr/local/apache-maven-3.3.9 export PATH=$PATH:$MAVEN_HOME/bin # source /etc/profile
三、安装IK分词插件
1、下载
到GitHub上下载适合ElasticSearch版本的IK,地址:https://github.com/medcl/elasticsearch-analysis-ik;也可以通过git clone https://github.com/medcl/elasticsearch-analysis-ik,下载分词器源码。
2、解压编译
# unzip elasticsearch-analysis-ik-master.zip # cd elasticsearch-analysis-ik-master/ # mvn clean package
3、复制编译完成的IK分词到elasticsearch的插件路径
# mkdir $elasticsearch/plugins/ik # cp target/releases/elasticsearch-analysis-ik-1.9.3.zip $elasticsearch/plugins/ik/ # cd $elasticsearch/plugins/ik/ # unzip elasticsearch-analysis-ik-1.9.3.zip
4、重启elasticsearch,使ik插件生效
# /etc/init.d/elasticsearch restart
四、ik分词测试
1、创建一个索引,名为“index”
# curl -XPUT http://localhost:9200/index
2、为“index”创建mapping
# curl -XPOST http://localhost:9200/index/fulltext/_mapping -d‘ { "fulltext": { "_all": { "analyzer": "ik_max_word", "search_analyzer": "ik_max_word", "term_vector": "no", "store": "false" }, "properties": { "content": { "type": "string", "store": "no", "term_vector": "with_positions_offsets", "analyzer": "ik_max_word", "search_analyzer": "ik_max_word", "include_in_all": "true", "boost": 8 } } } }‘
3、测试
# curl ‘http://10.10.10.26:9200/index/_analyze?analyzer=ik&pretty=true‘ -d ‘{"text":"中华人民共和国国歌"}‘
显示如下:
{ "tokens" : [ { "token" : "中华人民共和国", "start_offset" : 0, "end_offset" : 7, "type" : "CN_WORD", "position" : 0 }, { "token" : "中华人民", "start_offset" : 0, "end_offset" : 4, "type" : "CN_WORD", "position" : 1 }, { "token" : "中华", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 2 }, { "token" : "华人", "start_offset" : 1, "end_offset" : 3, "type" : "CN_WORD", "position" : 3 }, { "token" : "人民共和国", "start_offset" : 2, "end_offset" : 7, "type" : "CN_WORD", "position" : 4 }, { "token" : "人民", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 5 }, { "token" : "共和国", "start_offset" : 4, "end_offset" : 7, "type" : "CN_WORD", "position" : 6 }, { "token" : "共和", "start_offset" : 4, "end_offset" : 6, "type" : "CN_WORD", "position" : 7 }, { "token" : "国", "start_offset" : 6, "end_offset" : 7, "type" : "CN_CHAR", "position" : 8 }, { "token" : "国歌", "start_offset" : 7, "end_offset" : 9, "type" : "CN_WORD", "position" : 9 } ] }
elasticsearch-analysis-ik的Github地址:https://github.com/medcl/elasticsearch-analysis-ik
标签:
原文地址:http://www.cnblogs.com/Orgliny/p/5520292.html