06.中文分析器IKAnalyzer

时间：2017-02-28 13:26:22 阅读：207 评论：0 收藏：0 [点我收藏+]

标签：attribute content parser add word attr bom mat class

为什么需要使用IKAnalyzer

Lucene自带的标准分析器无法中文分词
Lucene自带的中文分析器分词不准确
IKAnalyzer支持屏蔽关键词、新词汇的配置

使用示例

建立索引时

略

QueryParser查询时

略

单独使用进行分词
自定义词库

在classpath下定义IKAnalyzer.cfg.xml文件，如下：
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">  
<properties>  
    <comment>IK Analyzer 扩展配置</comment>
    <!-- 用户可以在这里配置自己的扩展字典 -->
     <entry key="ext_dict">dicdata/mydict.dic</entry> 
     <!-- 用户可以在这里配置自己的扩展停用词字典    -->
    <entry key="ext_stopwords">dicdata/ext_stopword.dic</entry> 
</properties>
在classpath下的编辑dicdata/mydict.dic文件，此文件中存储扩展词库，在dicdata/ext_stopword.dic文件中存放停用词。
注意：mydict.dic和ext_stopword.dic文件的格式为UTF-8，注意是无BOM 的UTF-8 编码。

查看分词效果

//创建分析器
Analyzer analyzer = new IKAnalyzer();
//得到TokenStream
TokenStream tokenStream = analyzer.tokenStream("content", new StringReader("Lucene is a Java full-text search engine"));
//设置tokenStream初始状态，否则会抛异常
tokenStream.reset();
//设置要获取分词的偏移量
OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
//设置要获取分词的项
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
while(tokenStream.incrementToken()){
System.out.println("-----------------");
//起始偏移量
System.out.print("-->"+offsetAttribute.startOffset());
//截止偏移量
System.out.print("-->"+offsetAttribute.endOffset());
//分词项的值
System.out.println("-->"+new String(charTermAttribute.toString()));

注意

使用什么分析器建立索引最好使用相同的分析器查询
IKAnalyzer的配置文件名称可以自定义

06.中文分析器IKAnalyzer

标签：attribute content parser add word attr bom mat class

原文地址：http://www.cnblogs.com/wesly186/p/a5768e92e71dded6c334ae1250aa3659.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行