lucene学习记录（一）--lucene demo的学习

时间：2014-11-10 10:09:32 阅读：304 评论：0 收藏：0 [点我收藏+]

标签：des Lucene style blog http io ar os 使用

敬伟大的实践出真知！

以前研究过全文检索，不过当时重点放在了使用上，而且当时重点放在了基于lucene之上的工具zoie，没有时间好好研究一下真正的实现内容。故现在闲暇时间好好看看官网，研究一下lucene这个全文检索的根。由于水平有限，很多地方比较浅显而且可能会有错误，请看官海涵，敬请指正！

本篇文章直接跳过lucene的各种介绍，援引等等，直接从lucene自带的demo开始记录。

我使用的lucene版本是4.10.2。下载地址：下载，因为我使用的Windows环境，故直接下载了zip包，进行解压后，目录结构示意如下：

analysis--分词器
benchmark--标准、准则
classification--分类
codecs--编解码器
core--lucene核心jar包
demo--示例（包含war包）
docs--解说文档和api文档
expressions--表达式
facet--lucene统计查询包
group--组合查询包
highlighter--高亮包
join--索引和查询同时进行包
memory--主存储器
misc--索引工具和一些其他五花八门的代码
queries--过滤和查询
queryparser--查询解析和解析框架
replicator--复制索引
sandbox--多个第三方伙伴的贡献和新的想法
spatial--空间查询
suggest--自动建议和拼写检查
test-framework--测试框架

本篇研究demo文件夹下的demo。lucene的在线document地址：点击打开链接；demo的使用方法的链接在docement首页中的Getting Started部分的第一个链接：点击打开链接。如下是操作步骤：

1.在eclipse中创建一个工程，名字自取。和lucene的demo指导不一样，它的指导是非IDE环境。

2.在指导页面中的将依赖jar包放到classpath中：

Setting your CLASSPATH

First, you should download the latest Lucene distribution and then extract it to a working directory.

You need four JARs: the Lucene JAR, the queryparser JAR, the common analysis JAR, and the Lucene demo JAR. You should see the Lucene JAR file in the core/ directory you created when you extracted the archive -- it should be named something like lucene-core-{version}.jar. You should also see files called lucene-queryparser-{version}.jar, lucene-analyzers-common-{version}.jar and lucene-demo-{version}.jar under queryparser, analysis/common/ and demo/, respectively.

Put all four of these files in your Java CLASSPATH.

其实在eclipse下就是将依赖的jar包引入到工程的buildpath里。如图：

bubuko.com,布布扣

3.进行文件索引，即创建索引文件。我创建的测试类是DemoIndexWriter.java：

package lucene;

import org.apache.lucene.demo.IndexFiles;

public class DemoIndexWriter {

	public static void main(String[] args) {
		String[] arg0 = new String[]{"-docs","F:/workTestSpace/luceneDemo/src"};
		IndexFiles.main(arg0);
	}

}

在这个类中调用lucene-demo-4.10.2.jar包中的IndexFiles的main方法。需要注意的是-docs的值是一个拥有文件的文件夹（即不能是空文件夹，否则无可索引的文件），执行后，控制台输出如下->

bubuko.com,布布扣

根据提示得知已经将我的两个测试类文件进行索引了。并将索引文件放在了工程目录下的index目录下，将工程刷新，则index文件夹显示了出来。如图->

bubuko.com,布布扣

由于我创建过索引文件了，这次是重新创建，则原来的_0*的文件都被删除了。这些文件各代表什么意义，后面文章再详解。

4.创建检索测试类DemoIndexReader->

package lucene;

import org.apache.lucene.demo.SearchFiles;

public class DemoIndexReader {

	public static void main(String[] args) throws Exception {
		SearchFiles.main(args);
	}

}

调用了demo包中的SearchFiles的main方法，不需传入任何参数。然后在控制台进行一些输入操作->

bubuko.com,布布扣

查询一串无序无规律的字符串，结果是0个结果；查询‘String’，因为两个文件中都有String，所以查询到2个结果，并且进行了分页。

到此，则lucene的简单使用的demo就运行完了，可以很直观地看到检索的结果。在lucene的demo的overview页中最后对demo包中的IndexFiles.java和SearchFiles.java两个类进行了一定的源码剖析点击打开链接。在这里也简单的点一下。IndexFiles.java源码：IndexFiles.java->

usage是一个提示语的变量；

-index参数指向索引创建之后放置的目录

-docs参数指定需要索引的目录

-update参数指定是否在原来的索引文件中增加新的doc

配置IndexWriter代码：

 Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_10_0);
      IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_10_0, analyzer);

      if (create) {
        // Create a new index in the directory, removing any
        // previously indexed documents:
        iwc.setOpenMode(OpenMode.CREATE);
      } else {
        // Add new documents to an existing index:
        iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
      }

      // Optional: for better indexing performance, if you
      // are indexing many documents, increase the RAM
      // buffer.  But if you do this, increase the max heap
      // size to the JVM (eg add -Xmx512m or -Xmx1g):
      //
      // iwc.setRAMBufferSizeMB(256.0);

      IndexWriter writer = new IndexWriter(dir, iwc);

由代码可以得出，最简单的索引步骤需要分词器（Analyzer）和写索引对象（IndexWriter），而IndexWriter需要配置IndexWriterConfig实例，IndexWriterConfig除了配置源码中的分词器和操作方式之外还可以配置很多别的属性，可以适用各种场景，以后具体分析。创建了IndexWriter之后是进行真正的创建索引文件操作，这个实在indexDocs(writer,docDir);方法中实现的->

Document doc = new Document();

          // Add the path of the file as a field named "path".  Use a
          // field that is indexed (i.e. searchable), but don't tokenize 
          // the field into separate words and don't index term frequency
          // or positional information:
          Field pathField = new StringField("path", file.getPath(), Field.Store.YES);
          doc.add(pathField);

          // Add the last modified date of the file a field named "modified".
          // Use a LongField that is indexed (i.e. efficiently filterable with
          // NumericRangeFilter).  This indexes to milli-second resolution, which
          // is often too fine.  You could instead create a number based on
          // year/month/day/hour/minutes/seconds, down the resolution you require.
          // For example the long value 2011021714 would mean
          // February 17, 2011, 2-3 PM.
          doc.add(new LongField("modified", file.lastModified(), Field.Store.NO));

          // Add the contents of the file to a field named "contents".  Specify a Reader,
          // so that the text of the file is tokenized and indexed, but not stored.
          // Note that FileReader expects the file to be in UTF-8 encoding.
          // If that's not the case searching for special characters will fail.
          doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, StandardCharsets.UTF_8))));

          if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {
            // New index, so we just add the document (no old document can be there):
            System.out.println("adding " + file);
            writer.addDocument(doc);
          } else {
            // Existing index (an old copy of this document may have been indexed) so 
            // we use updateDocument instead to replace the old one matching the exact 
            // path, if present:
            System.out.println("updating " + file);
            writer.updateDocument(new Term("path", file.getPath()), doc);
          }

Document代表一份文档；Field是文档的组成部分（具体设置是否存储，是否索引，是否分词索引等），原先我使用的是2.9.1版本，只有一个Field，属性均需要自己设置，4.10.2版本的封装了很多具体的Field，多看看，选择合适的使用，方便多了。然后IndexWriter的addDocument方法是添加新文档，updateDocument方法是更新索引文档（如果存在则删除再添加，如果不存在则添加）。

SearchFiles.java类的源码：点此查看->

IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));
    IndexSearcher searcher = new IndexSearcher(reader);
    // :Post-Release-Update-Version.LUCENE_XY:
    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_10_0);

    BufferedReader in = null;
    if (queries != null) {
      in = new BufferedReader(new InputStreamReader(new FileInputStream(queries), StandardCharsets.UTF_8));
    } else {
      in = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));
    }
    // :Post-Release-Update-Version.LUCENE_XY:
    QueryParser parser = new QueryParser(Version.LUCENE_4_10_0, field, analyzer);

检索过程关键的组成部分就是IndexReader，由reader得到IndexSearcher（查询器），分词器（需要和索引时用一样的分词器，才能保证查询结果的正确性），QueryParser解析器，然后由解析器得到查询体（Query）。

TopDocs results = searcher.search(query, 5 * hitsPerPage);
    ScoreDoc[] hits = results.scoreDocs;

Document doc = searcher.doc(hits[i].doc);

得到的Document就可以解析出查询得到的内容了。

暂时先这些~~好累~~
待续~~

lucene学习记录（一）--lucene demo的学习

标签：des Lucene style blog http io ar os 使用

原文地址：http://blog.csdn.net/yichenlian/article/details/40951327

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行