Lucene实现自定义分词器(同义词查询与高亮)

时间：2015-01-28 11:14:12 阅读：172 评论：0 收藏：0 [点我收藏+]

今天我们实现一个简单的分词器，仅仅做演示使用功能如下：

1、分词按照空格、横杠、点号进行拆分；

2、实现hi与hello的同义词查询功能；

3、实现hi与hello同义词的高亮显示；

MyAnalyzer实现代码：

public class MyAnalyzer extends Analyzer {
	private int analyzerType;
	
	public MyAnalyzer(int type) {
		super();
		analyzerType = type;
	}

	@Override
	protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
		MyTokenizer tokenizer = new MyTokenizer(fieldName, reader, analyzerType);
		return new TokenStreamComponents(tokenizer);
	}
}

MyTokenizer实现代码：

public class MyTokenizer extends Tokenizer {
	public class WordUnit{
		WordUnit(String word, int start, int length){
			this.word = word;
			this.start = start;
			this.length = length;
//System.out.println("\tWordUnit: " + word + "|" + start + "|" + length);
		}
		
		String word;
		int start;
		int length;
	}
	
	private int analyzerType;
	private int endPosition;
	private Iterator<WordUnit> it;
	private ArrayList<WordUnit> words;
	
	private final CharTermAttribute termAtt;
	private final OffsetAttribute offsetAtt;
	
	public MyTokenizer(String fieldName, Reader in, int type) {
		super(in);
		
		it = null;
		endPosition = 0;
		analyzerType = type;
		offsetAtt = addAttribute(OffsetAttribute.class);
		termAtt = addAttribute(CharTermAttribute.class);
		addAttribute(PayloadAttribute.class);
	}	

	@Override
	public boolean incrementToken() throws IOException {
		clearAttributes();
		
		char[] inputBuf = new char[1024];
		if(it == null) {
			int bufSize = input.read(inputBuf);
			if(bufSize <= 0) return false;

			int beginIndex = 0;
			int endIndex = 0;
			words = new ArrayList<WordUnit>();
			for(endIndex = 0; endIndex < bufSize; endIndex++) {
				if(inputBuf[endIndex] != '-' && inputBuf[endIndex] != ' ' && inputBuf[endIndex] != '.') continue;

				addWord(inputBuf, beginIndex, endIndex);
				beginIndex = endIndex + 1;
			}
			addWord(inputBuf, beginIndex, endIndex);//add the last
			
			if(words.isEmpty()) return false;
			it = words.iterator();
		}
		
		if(it != null && it.hasNext()){
			WordUnit word = it.next();
			termAtt.append(word.word);
			termAtt.setLength(word.word.length());
			
			endPosition = word.start + word.length;
			offsetAtt.setOffset(word.start, endPosition);

			return true;
		}
			
		return false;
	}

	@Override
	public void reset() throws IOException {
		super.reset();
		
		it = null;
		endPosition = 0;
	}

	@Override
	public final void end() {
		int finalOffset = correctOffset(this.endPosition);
		offsetAtt.setOffset(finalOffset, finalOffset);
	}
	
	private void addWord(char[] inputBuf, int begin, int end){
		if(end <= begin) return;

		String word = new String(inputBuf, begin, end - begin);
		words.add(new WordUnit(word, begin, end - begin));
	
	 	if(analyzerType == 0 && word.equals("hi")) words.add(new WordUnit("hello", begin, 2));
	 	if(analyzerType == 0 && word.equals("hello")) words.add(new WordUnit("hi", begin, 5));
	}
}

索引的时候分词器类型：analyzerType=0；

搜索的时候分词器类型：analyzerType=1；

高亮的时候分词器类型：analyzerType=0；

搜索hello时的效果如下：

Score doc 0 hightlight to: look <em>hello</em> on
Score doc 1 hightlight to: I am <em>hi</em> China Chinese

可以看到含有hi的文档也被搜索出来，同样也会高亮。

Lucene实现自定义分词器(同义词查询与高亮)

标签：搜索全文检索 lucene

原文地址：http://blog.csdn.net/gdutliuyun827/article/details/43226527

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行