标签:Lucene style blog http color java 使用 os
做检索怎么都绕不过中文分词去,学习一下用IKAnalyzer分词器做中文分词。
参考文档
依赖jar包
1) IKAnalyzer2012FF_u1.jar 最好使用这个版本下面说 [百度网盘下载]
2) IKAnalyzer.cfg.xml [百度网盘下载]
3) keyWord.dic,stopWord.dic 字典
主要的类
1) IKAnalyzer , new IKAnalyzer(boolean useSmart); @param userSmart 是否智能分词,用性能换分词效果
1)分词成功了,但是查不出来
分词成功了只是一个假象,建索引的时候Field用了StringField,但是却忽略了一个问题StringField不分词。换成TextField就OK了
2) keyWord.dic 里添加了关键词‘高铁’,但还是分词 ‘高’ ‘铁’
IKAnalyzer的配置文件必须满足3个条件:
a)必须以.dic结尾
b)IKAnalyzer.cfg.xml必须放在src目录下,.dic文件没有要求,但是要在IKAnalyzer.cfg.xml下配置正确
c).dic必须要以无BOM的UTF-8编码
src
keyWord.dic
IKAnalyzer.cfg.xml
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 扩展配置</comment> <!--用户可以在这里配置自己的扩展字典--> <entry key="ext_dict">keyWord.dic</entry> <!--用户可以在这里配置自己的扩展停止词字典 <entry key="ext_stopwords">stopword.dic;</entry> --> </properties>
我把keyWord.dic 和 IKAnalyzer.cfg.xml 都放在了src下
查看分词效果
1 IKAnalyzer analyzer = new IKAnalyzer(true); 2 TokenStream ts = analyzer.tokenStream("keyWord", "高铁"); 3 CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class); 4 try { 5 ts.reset(); // Resets this stream to the beginning. (Required) 6 while (ts.incrementToken()) { 7 System.out.println("words: " + termAtt.toString()); 8 } 9 ts.end(); // Perform end-of-stream operations, e.g. set the final offset. 10 } finally { 11 ts.close(); // Release resources associated with this stream. 12 }
不加载字典 <!--用户可以在这里配置自己的扩展字典 <entry key="ext_dict">dic.dic</entry> --> words: 高 words: 铁 —————————————————————————————————————————————————————————————————————— 加载字典 <entry key="ext_dict">dic.dic</entry> 加载扩展词典:dic.dic words: 高铁
索引
1 Directory indexDir = FSDirectory.open(new File("E:/data/index")); 2 IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, new IKAnalyzer(true)); 3 IndexWriter indexWriter = new IndexWriter(indexDir, config); 4 indexWriter.deleteAll(); 5 Document doc = new Document(); 6 doc.add(new StringField("name", "爸爸去哪儿", Store.YES)); 7 System.out.println(doc); 8 indexWriter.addDocument(doc); 9 indexWriter.close();
搜索
long start = System.currentTimeMillis(); IndexReader indexReader = DirectoryReader.open(FSDirectory.open(new File(E:/data/index))); indexSearcher = new IndexSearcher(indexReader); QueryParser parser = new QueryParser(Version.LUCENE_46, "content", new IKAnalyzer(true));
Query query = parser.parse("爸爸");
TopDocs results = indexSearcher.search(query, 10);
ScoreDoc[] hits = results.scoreDocs;
int totalHits = results.totalHits;
for(int i = 0; i < totalHits; i++) {
Document doc = indexSearcher.doc(hits[i].doc);
System.out.println("["+doc.get("name")+"] ");
System.out.println();
}
long end = System.currentTimeMillis();
System.out.println("找到"+totalHits+"条记录,耗时:"+(end-start)+"ms");
Lucene实践之中文分词IKAalyzer,布布扣,bubuko.com
标签:Lucene style blog http color java 使用 os
原文地址:http://www.cnblogs.com/erbin/p/3925943.html