标签:des Lucene style blog http io ar os 使用
敬伟大的实践出真知!
以前研究过全文检索,不过当时重点放在了使用上,而且当时重点放在了基于lucene之上的工具zoie,没有时间好好研究一下真正的实现内容。故现在闲暇时间好好看看官网,研究一下lucene这个全文检索的根。由于水平有限,很多地方比较浅显而且可能会有错误,请看官海涵,敬请指正!
本篇文章直接跳过lucene的各种介绍,援引等等,直接从lucene自带的demo开始记录。
我使用的lucene版本是4.10.2。下载地址:下载,因为我使用的Windows环境,故直接下载了zip包,进行解压后,目录结构示意如下:
analysis--分词器 benchmark--标准、准则 classification--分类 codecs--编解码器 core--lucene核心jar包 demo--示例(包含war包) docs--解说文档和api文档 expressions--表达式 facet--lucene统计查询包 group--组合查询包 highlighter--高亮包 join--索引和查询同时进行包 memory--主存储器 misc--索引工具和一些其他五花八门的代码 queries--过滤和查询 queryparser--查询解析和解析框架 replicator--复制索引 sandbox--多个第三方伙伴的贡献和新的想法 spatial--空间查询 suggest--自动建议和拼写检查 test-framework--测试框架本篇研究demo文件夹下的demo。lucene的在线document地址:点击打开链接;demo的使用方法的链接在docement首页中的Getting Started部分的第一个链接:点击打开链接。如下是操作步骤:
1.在eclipse中创建一个工程,名字自取。和lucene的demo指导不一样,它的指导是非IDE环境。
2.在指导页面中的将依赖jar包放到classpath中:
Setting your CLASSPATH
First, you should download the latest Lucene distribution and then extract it to a working directory.
You need four JARs: the Lucene JAR, the queryparser JAR, the common analysis JAR, and the Lucene demo JAR. You should see the Lucene JAR file in the core/ directory you created when you extracted the archive -- it should be named something like lucene-core-{version}.jar. You should also see files called lucene-queryparser-{version}.jar, lucene-analyzers-common-{version}.jar and lucene-demo-{version}.jar under queryparser, analysis/common/ and demo/, respectively.
Put all four of these files in your Java CLASSPATH.其实在eclipse下就是将依赖的jar包引入到工程的buildpath里。如图:
3.进行文件索引,即创建索引文件。我创建的测试类是DemoIndexWriter.java:
package lucene;
import org.apache.lucene.demo.IndexFiles;
public class DemoIndexWriter {
public static void main(String[] args) {
String[] arg0 = new String[]{"-docs","F:/workTestSpace/luceneDemo/src"};
IndexFiles.main(arg0);
}
}在这个类中调用lucene-demo-4.10.2.jar包中的IndexFiles的main方法。需要注意的是-docs的值是一个拥有文件的文件夹(即不能是空文件夹,否则无可索引的文件),执行后,控制台输出如下->
根据提示得知已经将我的两个测试类文件进行索引了。并将索引文件放在了工程目录下的index目录下,将工程刷新,则index文件夹显示了出来。如图->
由于我创建过索引文件了,这次是重新创建,则原来的_0*的文件都被删除了。这些文件各代表什么意义,后面文章再详解。
4.创建检索测试类DemoIndexReader->
package lucene;
import org.apache.lucene.demo.SearchFiles;
public class DemoIndexReader {
public static void main(String[] args) throws Exception {
SearchFiles.main(args);
}
}
调用了demo包中的SearchFiles的main方法,不需传入任何参数。然后在控制台进行一些输入操作->
查询一串无序无规律的字符串,结果是0个结果;查询‘String’,因为两个文件中都有String,所以查询到2个结果,并且进行了分页。
到此,则lucene的简单使用的demo就运行完了,可以很直观地看到检索的结果。在lucene的demo的overview页中最后对demo包中的IndexFiles.java和SearchFiles.java两个类进行了一定的源码剖析点击打开链接。在这里也简单的点一下。IndexFiles.java源码:IndexFiles.java->
usage是一个提示语的变量;
-index参数指向索引创建之后放置的目录
-docs参数指定需要索引的目录
-update参数指定是否在原来的索引文件中增加新的doc
配置IndexWriter代码:
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_10_0);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_10_0, analyzer);
if (create) {
// Create a new index in the directory, removing any
// previously indexed documents:
iwc.setOpenMode(OpenMode.CREATE);
} else {
// Add new documents to an existing index:
iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
}
// Optional: for better indexing performance, if you
// are indexing many documents, increase the RAM
// buffer. But if you do this, increase the max heap
// size to the JVM (eg add -Xmx512m or -Xmx1g):
//
// iwc.setRAMBufferSizeMB(256.0);
IndexWriter writer = new IndexWriter(dir, iwc);由代码可以得出,最简单的索引步骤需要分词器(Analyzer)和写索引对象(IndexWriter),而IndexWriter需要配置IndexWriterConfig实例,IndexWriterConfig除了配置源码中的分词器和操作方式之外还可以配置很多别的属性,可以适用各种场景,以后具体分析。创建了IndexWriter之后是进行真正的创建索引文件操作,这个实在indexDocs(writer,docDir);方法中实现的->
Document doc = new Document();
// Add the path of the file as a field named "path". Use a
// field that is indexed (i.e. searchable), but don't tokenize
// the field into separate words and don't index term frequency
// or positional information:
Field pathField = new StringField("path", file.getPath(), Field.Store.YES);
doc.add(pathField);
// Add the last modified date of the file a field named "modified".
// Use a LongField that is indexed (i.e. efficiently filterable with
// NumericRangeFilter). This indexes to milli-second resolution, which
// is often too fine. You could instead create a number based on
// year/month/day/hour/minutes/seconds, down the resolution you require.
// For example the long value 2011021714 would mean
// February 17, 2011, 2-3 PM.
doc.add(new LongField("modified", file.lastModified(), Field.Store.NO));
// Add the contents of the file to a field named "contents". Specify a Reader,
// so that the text of the file is tokenized and indexed, but not stored.
// Note that FileReader expects the file to be in UTF-8 encoding.
// If that's not the case searching for special characters will fail.
doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, StandardCharsets.UTF_8))));
if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {
// New index, so we just add the document (no old document can be there):
System.out.println("adding " + file);
writer.addDocument(doc);
} else {
// Existing index (an old copy of this document may have been indexed) so
// we use updateDocument instead to replace the old one matching the exact
// path, if present:
System.out.println("updating " + file);
writer.updateDocument(new Term("path", file.getPath()), doc);
}SearchFiles.java类的源码:点此查看->
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));
IndexSearcher searcher = new IndexSearcher(reader);
// :Post-Release-Update-Version.LUCENE_XY:
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_10_0);
BufferedReader in = null;
if (queries != null) {
in = new BufferedReader(new InputStreamReader(new FileInputStream(queries), StandardCharsets.UTF_8));
} else {
in = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));
}
// :Post-Release-Update-Version.LUCENE_XY:
QueryParser parser = new QueryParser(Version.LUCENE_4_10_0, field, analyzer);检索过程关键的组成部分就是IndexReader,由reader得到IndexSearcher(查询器),分词器(需要和索引时用一样的分词器,才能保证查询结果的正确性),QueryParser解析器,然后由解析器得到查询体(Query)。
TopDocs results = searcher.search(query, 5 * hitsPerPage);
ScoreDoc[] hits = results.scoreDocs;Document doc = searcher.doc(hits[i].doc);得到的Document就可以解析出查询得到的内容了。
暂时先这些~~好累~~
待续~~
标签:des Lucene style blog http io ar os 使用
原文地址:http://blog.csdn.net/yichenlian/article/details/40951327