Lucene初试——关于大文本建立索引和中文乱码以及QueryParser检索的一些体会

时间：2014-09-12 12:02:13 阅读：270 评论：0 收藏：0 [点我收藏+]

标签：analyzer lucene全文检索 lucene中文乱码 lucene大文本 lucene大文本建索引

这几天因为一个小项目用到Lucene，于是去学习了一下，现在还有很多地方没有了解，先就我遇到的问题做下总结。

一、大文本建索引问题

我这里说的大文本，实际上也就200M左右的txt，或许不应该成为大文本，但是我在建索引时遇到200M左右的的确导致了内存溢出，报错误java.lang.OutOfMemoryError: Java heap space ，到网上查了很久，试了一些方法，比如修改JVM的运行参数等，都不行。我测试的机器为i5四核，4G内存，实测时可用内存1G多，按说对于200M的文本不应该可以接受吗？但是就是出现了内存溢出的情况。在对Lucene的机制还不了解的情况下，我想到了以下几种解决方案，一个是切割文本，将大文本首先预处理以下，分成小的文本，二是在建立索引时，对于大文本分段建立，比如读到50M往磁盘写一次，三是按行建立索引，比如一次读1w行。其实我觉得这些内在都是一个道理，就是一点点切分文本，只是实现方式稍有区别。

第一种情况我没有测试，觉得太麻烦，还要写个独立程序切割文本。第二种方式，我的代码逻辑是，读到一定大小的数据之后，就建立一个Document对象，然后设置setMaxBufferedDocs（n），我实测是10M的时候存一个Document，然后setMaxBufferedDocs（5），根据Lucene的官方文档，当内存中的缓存到达指定大小（我设置的200M）或者doc数目达到指定大小（也就是这里设置的5）时，就会触发一次往磁盘里写数据的操作。我觉的这样的话，内存里顶多只会有5*10=50M，大不了IO太频繁降低速度而已，至少得能跑。但实际测试过程中，发现跑了很久很久都没有把一个文本（170M）索引建好，按照我的理解，170M也就3、4次写操作不就可以了，但事实是很久没出现结果。于是我放弃了这种想法，没再去细究。

后来我用第三种方式测试了下，可行，而且效率还可以。思路是，每读1W行建立一个doc，我的这个文本比较特殊，2W行大概也就1m，然后设置setMaxBufferedDocs（100），这样反而可以。具体的数值设置可以根据自己环境来看，代码我也就不贴了，就是循环按行读取文本，1w次之后建立一个Document对象。

后来因为业务的需求，我又改成了1行建一个Document，然后setMaxBufferedDocs（2W），实测效率也可以。因为我检索时需要每一行的信息，所以只能按行读取。比如我的一行数据为“1052307934----huajun7089059----73.63.134.205----安徽省滁州市电信ADSL----2011年7月10日----14:28:24”，我搜索“安徽省滁州市”如果搜到了这条记录，那么我需要这一行的所有信息，如果不按每一行一个Document的话，比如我现在10行一个Document，那么噪音数据太多了。不知道Lucene有没有提供只提取我需要的信息的功能？要不然我就得自己写，从10行中找到我要的，那样也没有多大意义。

二、中文乱码问题

中文乱码真是无处不在。如果文件编码、系统编码、运行环境编码都一样应该不会出现乱码问题。如果知道文件的编码，一般

reader = new BufferedReader(new InputStreamReader(
							new FileInputStream(file), “gbk”));

类似这样的设置就可以解决，但是我的比较麻烦，因为源文件有的是gbk有的是utf-8。所以还得动态的去识别当前文件的编码。我查了下，识别文件编码好像不是那么容易，网上给了很多例子都不是很精确或者稳定，还好我找了一个，对于我的txt还是有用的，代码如下，共享下（非原创）：

/**
	 * 查询字符编码
	 * @param fileName
	 * @return UTF-8/Unicode/UTF-16BE/GBK
	 * @throws Exception
	 */
	public static String codeStringPlus(String fileName) throws Exception {
		BufferedInputStream bin = null;
		String code = null;

		try {
			bin = new BufferedInputStream(new FileInputStream(fileName));
			int p = (bin.read() << 8) + bin.read();
			switch (p) {
			case 0xefbb:
				code = "UTF-8";
				break;
			case 0xfffe:
				code = "Unicode";
				break;
			case 0xfeff:
				code = "UTF-16BE";
				break;
			default:
				code = "GBK";
			}
		} catch (Exception e) {
			e.printStackTrace();
		} finally {
			bin.close();
		}

		return code;
	}

这样，我的编码问题就解决了。感谢这位贡献者，虽然可能还有缺陷，但是在我这够用了。

三、关于检索的一些认识

之前一直很困惑一个问题，场景如下，

比如文本“Hello ，I am Chinese”，

建索引之后我查询“hello”、“Chinese”都是可以查到的，但我查询“Chine”为何就是查不到呢？难道是我写错了？

后来我想明白了，因为我查询的时候也用了分析器Analyzer。根据我的理解，我觉得Lucene是这样一个过程，在建立索引的时候首先分词，上述文本会被解析成“Hello”、“I”、“am”、“Chinese”，（当然可能I和am会被解析器去掉，这些价值不大，这里假设他们是有意义的）。

然后我查询的时候是用QueryParser加Analyzer查询的，也就是说，我输入Chinese的时候，首先也经历了分词的过程，会将我的关键字解析成“Chinese”，然后去搜索，是可以的，而当我输入“Chine”的时候，解析器只会解析成“Chine”，这个在索引里是没有的！当然查不到。如果想要输入Chine也能查到的话，可能需要额外的操作，比如换一种查询方式之类的，而我并没有去做，所以引起了这些困惑。

四、其他格式文本的检索

Lucene是不关心源文件的文件格式的，也就是说，得自己将不同格式的文档转换成纯文本，需要自己去写不同格式的解析器，而不是直接拿过来建索引。

以上是我目前对Lucene的理解，不知道是不是都对，仅供参考，希望读到此文的朋友能给出一点意见建议，共同学习。

五、下面附上代码：

环境是Windows下Lucene4.10.0+Myeclipse2013+JDK1.7，4GRam。

1、建立索引（这里都是txt）

/**
 * LuceneTest 
 * com.lucene.sheen.mine  
 */
package com.lucene.sheen.mine;

import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.lang.management.ManagementFactory;
import java.lang.management.MemoryMXBean;
import java.lang.management.MemoryUsage;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

import com.chenlb.mmseg4j.analysis.MMSegAnalyzer;

/**
 * @author Sheen 2014-9-10
 * 
 */
public class MyIndex {

	/**
	 * @param args
	 * @throws Exception
	 */
	public static void main(String[] args) throws Exception {
		String docPath = "resource\\data";
		String indexPath = "resource\\index";

		File docFile = new File(docPath);
		if (!docFile.exists() || !docFile.canRead()) {
			System.out.println("您所选择的文件夹不存在或者没有访问权限！文件路径："
					+ docFile.getAbsolutePath());
			System.exit(1);
		}
		Date start = new Date();

		Directory indexDir = FSDirectory.open(new File(indexPath));
		Analyzer analyzer = new MMSegAnalyzer();
		IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_10_0,
				analyzer);
		iwc.setRAMBufferSizeMB(200).setMaxBufferedDocs(20000);
		iwc.setOpenMode(OpenMode.CREATE);
		IndexWriter writer = new IndexWriter(indexDir, iwc);

		MemoryMXBean memorymbean = ManagementFactory.getMemoryMXBean();
		MemoryUsage usage = memorymbean.getHeapMemoryUsage();
		System.out.println("INIT HEAP: " + usage.getInit());
		System.out.println("MAX HEAP: " + usage.getMax());
		System.out.println("USE HEAP: " + usage.getUsed());

		indexDoc(writer, docFile);
		writer.close();
		Date end = new Date();
		seeVMStatus();
		System.out.println("所有文件建立索引完毕，耗时:"
				+ (double) (end.getTime() - start.getTime()) / (1000 * 60)
				+ "min");
	}

	static void indexDoc(IndexWriter writer, File file) throws Exception {
		if (file.canRead()) {
			if (file.isDirectory()) {
				File[] files = file.listFiles();
				for (File thisFile : files) {
					indexDoc(writer, thisFile);
				}
			} else {
				String code = codeString(file.getAbsolutePath());
				System.out.println("**********文件：" + file.getAbsolutePath()
						+ "正在建立索引********************");
				System.out.println("字符编码：" + code);
				seeVMStatus();
				BufferedReader reader = null;
				try {
					Field pathField = new StringField("path", file.getPath(),
							Field.Store.YES);
					reader = new BufferedReader(new InputStreamReader(
							new FileInputStream(file), code));
					String line = null;
					long fileSize = 0;
					while ((line = reader.readLine()) != null) {
						fileSize += line.getBytes().length;
						Document doc = new Document();
						doc.add(pathField);
						Field textField = new TextField("contents", line,
								Store.YES);
						doc.add(textField);
						writer.addDocument(doc);

					}
					System.out.println("TotalSize:" + fileSize / (1024 * 1024)
							+ "M");
					System.out.println("建立索引完毕\n");

				} catch (Exception e) {
					e.printStackTrace();
				} finally {
					reader.close();
				}
			}

		}
	}

	/**
	 * 查看虚拟机内存信息
	 */
	public static void seeVMStatus() {
		MemoryMXBean memorymbean = ManagementFactory.getMemoryMXBean();
		System.out.println("JVM Full Information:");
		System.out.println("Heap Memory Usage: "
				+ memorymbean.getHeapMemoryUsage());
		System.out.println("Non-Heap Memory Usage: "
				+ memorymbean.getNonHeapMemoryUsage());
	}

	
	/**
	 * 查询字符编码
	 * @param fileName
	 * @return UTF-8/Unicode/UTF-16BE/GBK
	 * @throws Exception
	 */
	public static String codeStringPlus(String fileName) throws Exception {
		BufferedInputStream bin = null;
		String code = null;

		try {
			bin = new BufferedInputStream(new FileInputStream(fileName));
			int p = (bin.read() << 8) + bin.read();
			switch (p) {
			case 0xefbb:
				code = "UTF-8";
				break;
			case 0xfffe:
				code = "Unicode";
				break;
			case 0xfeff:
				code = "UTF-16BE";
				break;
			default:
				code = "GBK";
			}
		} catch (Exception e) {
			e.printStackTrace();
		} finally {
			bin.close();
		}

		return code;
	}
	/**
	 * 查询字符编码是UTF-8还是GBK
	 * @param fileName
	 * @return UTF-8/GBK
	 * @throws Exception
	 */
	public static String codeString(String fileName) throws Exception {
		BufferedInputStream bin = null;
		String code = null;
		try {
			bin = new BufferedInputStream(new FileInputStream(fileName));
			int p = (bin.read() << 8) + bin.read();
			switch (p) {
			case 0xefbb:
				code = "UTF-8";
				break;
			default:
				code = "GBK";
			}
		} catch (Exception e) {
			e.printStackTrace();
		} finally {
			bin.close();
		}
		return code;
	}


}

2、查询

/**
 * LuceneTest 
 * com.lucene.sheen.mine  
 */
package com.lucene.sheen.mine;

import java.io.File;
import java.io.IOException;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;

import com.chenlb.mmseg4j.analysis.MMSegAnalyzer;

/**
 * @author Sheen  2014-9-10
 *
 */
public class MySearcher {

	/**
	 * @param args
	 * @throws IOException 
	 * @throws ParseException 
	 */
	public static void main(String[] args) throws IOException, ParseException {
		String index = "resource\\index";
		String field = "contents";
		String queryString = "870270291";
		
		IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));
		IndexSearcher searcher = new IndexSearcher(reader);
		
		Analyzer analyzer = new MMSegAnalyzer();
		
		QueryParser parser = new QueryParser(field, analyzer);
		Query query = parser.parse(queryString);
		System.out.println("查询关键字："+query.toString());
		Date start = new Date();
		TopDocs results = searcher.search(query, 20);
		ScoreDoc[] hits = results.scoreDocs;
		for(ScoreDoc sdoc : hits){
			Document doc = searcher.doc(sdoc.doc);
			System.out.println("查询结果：");
			System.out.println(sdoc.score);
			System.out.println(doc.get("path"));
			System.out.println(new String(doc.get("contents").getBytes(),"UTF-8"));
		}
		Date end = new Date();
		System.out.println("耗时："+(end.getTime()-start.getTime()));
	}

}

Lucene初试——关于大文本建立索引和中文乱码以及QueryParser检索的一些体会

标签：analyzer lucene全文检索 lucene中文乱码 lucene大文本 lucene大文本建索引

原文地址：http://blog.csdn.net/sheen1991/article/details/39226113

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行