这一个星期花时间好好学习了一下lucene/solr,今天好好总结一下,写点文章记录点重要的东西,以便日后不至于丈二和尚摸不着头脑,
这一篇文章主要是简单的介绍一下lucene分词过程中的分词流程,和一些简单原理的讲解,希望不妥这处读者能够指正,不胜感激!!
(一)主要分词器
WhitespaceAnalyzer、StopAnalyzer、SimpleAnalyzer、KeywordAnalyzer,他们都的父类都是Analazer,在Analazer类中有一个抽象方法叫做tokenStream
package org.apache.lucene.analysis;
import java.io.Reader;
import java.io.IOException;
import java.io.Closeable;
import java.lang.reflect.Modifier;
import org.apache.lucene.util.CloseableThreadLocal;
import org.apache.lucene.store.AlreadyClosedException;
import org.apache.lucene.document.Fieldable;
/** An Analyzer builds TokenStreams, which analyze text. It thus represents a
* policy for extracting index terms from text.
* <p>
* Typical implementations first build a Tokenizer, which breaks the stream of
* characters from the Reader into raw Tokens. One or more TokenFilters may
* then be applied to the output of the Tokenizer.
* <p>The {@code Analyzer}-API in Lucene is based on the decorator pattern.
* Therefore all non-abstract subclasses must be final or their {@link #tokenStream}
* and {@link #reusableTokenStream} implementations must be final! This is checked
* when Java assertions are enabled.
*/
public abstract class Analyzer implements Closeable {
<span style="white-space:pre"> </span>//.....此段代码只提取了关键部分
/** Creates a TokenStream which tokenizes all the text in the provided
* Reader. Must be able to handle null field name for
* backward compatibility.
*/
public abstract TokenStream tokenStream(String fieldName, Reader reader);
}
下图流出来一下常见的tokenizer
下面简单的说一下simpletokenizer的流程。我们看看simple的源代码
public final class SimpleAnalyzer extends ReusableAnalyzerBase {
private final Version matchVersion;
/**
* Creates a new {@link SimpleAnalyzer}
* @param matchVersion Lucene version to match See {@link <a href="#version">above</a>}
*/
public SimpleAnalyzer(Version matchVersion) {
this.matchVersion = matchVersion;
}
/**
* Creates a new {@link SimpleAnalyzer}
* @deprecated use {@link #SimpleAnalyzer(Version)} instead
*/
@Deprecated public SimpleAnalyzer() {
this(Version.LUCENE_30);
}
@Override
protected TokenStreamComponents createComponents(final String fieldName,
final Reader reader) {
return new TokenStreamComponents(new LowerCaseTokenizer(matchVersion, reader));//这里覆盖了父类创建tokenstream的方法。并且传递进去一个lowercaseTokeniz//er这就是为什么英文字母都会转换成小写的原因,
}
}public abstract class TokenStream extends AttributeSource implements Closeable {
/**
* Consumers (i.e., {@link IndexWriter}) use this method to advance the stream to
* the next token. Implementing classes must implement this method and update
* the appropriate {@link AttributeImpl}s with the attributes of the next
* token.
* <P>
* The producer must make no assumptions about the attributes after the method
* has been returned: the caller may arbitrarily change it. If the producer
* needs to preserve the state for subsequent calls, it can use
* {@link #captureState} to create a copy of the current attribute state.
* <p>
* This method is called for every token of a document, so an efficient
* implementation is crucial for good performance. To avoid calls to
* {@link #addAttribute(Class)} and {@link #getAttribute(Class)},
* references to all {@link AttributeImpl}s that this stream uses should be
* retrieved during instantiation.
* <p>
* To ensure that filters and consumers know which attributes are available,
* the attributes must be added during instantiation. Filters and consumers
* are not required to check for availability of attributes in
* {@link #incrementToken()}.
*
* @return false for end of stream; true otherwise
*/
public abstract boolean incrementToken() throws IOException;
}注释已经说得很清楚了,这个方法必须要子类实现用来判断是否还有下一个词汇,返回一个boolean值,下面是chartokenizer的increamentToken方法
@Override
public final boolean incrementToken() throws IOException {
clearAttributes();
if(useOldAPI) // TODO remove this in LUCENE 4.0
return incrementTokenOld();
int length = 0;
int start = -1; // this variable is always initialized
char[] buffer = termAtt.buffer();
while (true) {
if (bufferIndex >= dataLen) {
offset += dataLen;
if(!charUtils.fill(ioBuffer, input)) { // read supplementary char aware with CharacterUtils
dataLen = 0; // so next offset += dataLen won't decrement offset
if (length > 0) {
break;
} else {
finalOffset = correctOffset(offset);
return false;
}
}
dataLen = ioBuffer.getLength();
bufferIndex = 0;
}
// use CharacterUtils here to support < 3.1 UTF-16 code unit behavior if the char based methods are gone
final int c = charUtils.codePointAt(ioBuffer.getBuffer(), bufferIndex);
bufferIndex += Character.charCount(c);
if (isTokenChar(c)) { // if it's a token char
if (length == 0) { // start of token
assert start == -1;
start = offset + bufferIndex - 1;
} else if (length >= buffer.length-1) { // check if a supplementary could run out of bounds
buffer = termAtt.resizeBuffer(2+length); // make sure a supplementary fits in the buffer
}
length += Character.toChars(normalize(c), buffer, length); // buffer it, normalized
if (length >= MAX_WORD_LEN) // buffer overflow! make sure to check for >= surrogate pair could break == test
break;
} else if (length > 0) // at non-Letter w/ chars
break; // return 'em
}
termAtt.setLength(length);
assert start != -1;
offsetAtt.setOffset(correctOffset(start), finalOffset = correctOffset(start+length));
return true;
} @Override
protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {
final Tokenizer source = new LowerCaseTokenizer(matchVersion, reader);//经过了一个lowercase和一个stopfilter
return new TokenStreamComponents(source, new StopFilter(matchVersion,
source, stopwords));
}
,经过了这些之后最后才能成为了一个tokenStream流。那么常见的filter有哪些呢?
这就是一些常见的filter,比如有stopfilter,lowercasefilter,等等,得到tokenizer的流之后再经过这些过滤器之后才能形成一个真正的tokenSTREAM
(二)、词汇信息如何保存
这里必须要提到3个类 CharTermAttribute(保存具体词汇),OffsetAttribute(保存词汇之间的偏移量),PositionIncrementAttribute(保存词与词之间的位置增量)。
有了这3个东西,就能确定一篇文档当中具体的位置,比如how are you thank you这句话在lucene中其实上是这个样子的(位置偏亮错了应该是how后面应该是3,以此类推)
这些东西都有一个叫做AttributeSource的类决定,这个类中保存了这些信息。里面有一个叫做STATE的静态内部类。这个类中存储了当前stram类的位置信息。我们在以后的过程中可以用下面这个方法捕获当前状态
/**
* Captures the state of all Attributes. The return value can be passed to
* {@link #restoreState} to restore the state of this or another AttributeSource.
*/
public State captureState() {
final State state = this.getCurrentState();
return (state == null) ? null : (State) state.clone();
}我们能够得到这些词汇的位置信息之后,我们可以做很多事情。比如同义词(加上一个词使他的偏移量和位置增量与之相同),删除敏感词等等!第一篇总结就到此结束了。
转载请注明http://blog.csdn.net/a837199685/article
原文地址:http://blog.csdn.net/a837199685/article/details/43449945