码迷,mamicode.com
首页 > Web开发 > 详细

lucene的搜索过程(索引文件)

时间:2015-08-28 19:19:55      阅读:277      评论:0      收藏:0      [点我收藏+]

标签:

---恢复内容开始---

搜索的过程总的来说就是将词典及倒排表信息从索引中读出来,根据用户的查询语句合并倒排表,得到结果文档集并对文档进行打分的过程。

如图:

技术分享

 

总共包含以下几个过程:

  1. index打开索引文件,读取并打开指向索引文件的流。
  2. 用户输入查询语句。
  3. 将查询语句转为查询对象Query对象树。(从luke中可以看出来)
  4. 构造weight对象树,用于计算词的权重,也即计算打分公司中与搜索语句有关,与文档无关的部分(红色部分)。
  5. 构造Score对象树,用于计算打分。
  6. 在构造score对象树的过程中,其叶子节点的TermSocre会将词典和倒排表从索引中读取出来。
  7. 构造SumSocre对象树,其是为了方便合并倒排表对Socre对象树的从新组织,它的叶子节点仍为TermSocre,包含词典和倒排表。此步将倒排表合并后得到结果文档集,并对结果文档计算打分公式中的蓝色部分。打分公式中的求和符合,并非简单的相加,而是根据子查询倒排表的合并方式(与或非)来对子查询的打分求和,计算出父查询的打分。
  8. 将收集的结果集打分返回给用户。

lucene搜索详细过程:

为了解析Lucene对索引文件搜索的过程,预先写入索引了如下几个文件:

file01.txt: apple apples cat dog

file02.txt: apple boy cat category

file03.txt: apply dog eat etc

file04.txt: apply cat foods

打开IndexReader指向索引文件夹

Indexreader reader = IndexReader.open(FSDirectory.open(indexDir));

其实调用的是DirectoryReader.open(Directory,IndexDeletionPolicy,IndexCommit,boolean,int),其主要的作用是生成一个SegmentInfo.FindSegmentsFIle对象;并用它来找到该索引文件中的所有段。

源码跟踪:

IndexReader reader = IndexReader.open(indexpath);

|__open方法

 

 public static IndexReader open(final Directory directory) throws CorruptIndexException, IOException {
      return open(directory, null, null, true, DEFAULT_TERMS_INDEX_DIVISOR);
    }

  

     |__进入return 的open(),对一些没有传进的参数设null值

     

private static IndexReader open(final Directory directory, final IndexDeletionPolicy deletionPolicy, final IndexCommit commit, final boolean readOnly,         int termInfosIndexDivisor) throws CorruptIndexException, IOException {

        return DirectoryReader.open(directory, deletionPolicy, commit, readOnly, termInfosIndexDivisor);
          }

  

到了这里就执行到了DirectoryReader.open(directory, deletionPolicy, commit, readOnly, termInfosIndexDivisor);

所以,在调用indexReader.open(...)的最终执行到的是DirectoryReader.open(),其主要作用是生成一个SegmentInfos.FindSegmentsFile对象,并用它来找到此索引文件中所有的段,并打开这些段。

 具体的源代码如下:从directoryReader.open()到segmentInfos.FindSegmentsFile,directoryReader.open()所调用的方法:

static IndexReader open(final Directory directory, final IndexDeletionPolicy deletionPolicy, final IndexCommit commit, final boolean readOnly,
                          final int termInfosIndexDivisor) throws CorruptIndexException, IOException {
    return (IndexReader) new SegmentInfos.FindSegmentsFile(directory) {
      @Override
      protected Object doBody(String segmentFileName) throws CorruptIndexException, IOException {
        SegmentInfos infos = new SegmentInfos();
        infos.read(directory, segmentFileName);
        if (readOnly)
          return new ReadOnlyDirectoryReader(directory, infos, deletionPolicy, termInfosIndexDivisor);
        else
          return new DirectoryReader(directory, infos, deletionPolicy, false, termInfosIndexDivisor);
      }
    }.run(commit);
  }

  在segmentInfos.FindSegmentsFile(directory)中,调用了 public abstract static class FindSegmentsFile,它是segmentInfos类的内部抽象类。这是一个工具类,就是获取当前段的信息,这是在lock-less中是必要的。因为可能在你找到当前段文件的名称,打开它,读取内容,检查是否被修改过等期间,它可能已经被提交了删除的请求。(源码注释)

在抽象类中有个run方法:

    public Object run(IndexCommit commit) throws CorruptIndexException, IOException {
      if (commit != null) {
        if (directory != commit.getDirectory())
          throw new IOException("the specified commit does not match the specified Directory");
        return doBody(commit.getSegmentsFileName());
      }

      String segmentFileName = null;
      long lastGen = -1;
      long gen = 0;
      int genLookaheadCount = 0;
      IOException exc = null;
      boolean retry = false;

      int method = 0;

      // Loop until we succeed in calling doBody() without
      // hitting an IOException.  An IOException most likely
      // means a commit was in process and has finished, in
      // the time it took us to load the now-old infos files
      // (and segments files).  It‘s also possible it‘s a
      // true error (corrupt index).  To distinguish these,
      // on each retry we must see "forward progress" on
      // which generation we are trying to load.  If we
      // don‘t, then the original error is real and we throw
      // it.
      
      // We have three methods for determining the current
      // generation.  We try the first two in parallel, and
      // fall back to the third when necessary.

      while(true) {

        if (0 == method) {

          // Method 1: list the directory and use the highest
          // segments_N file.  This method works well as long
          // as there is no stale caching on the directory
          // contents (NOTE: NFS clients often have such stale
          // caching):
          String[] files = null;

          long genA = -1;

          files = directory.listAll();
          
          if (files != null)
            genA = getCurrentSegmentGeneration(files);

          message("directory listing genA=" + genA);

          // Method 2: open segments.gen and read its
          // contents.  Then we take the larger of the two
          // gen‘s.  This way, if either approach is hitting
          // a stale cache (NFS) we have a better chance of
          // getting the right generation.
          long genB = -1;
          for(int i=0;i<defaultGenFileRetryCount;i++) {
            IndexInput genInput = null;
            try {
              genInput = directory.openInput(IndexFileNames.SEGMENTS_GEN);
            } catch (FileNotFoundException e) {
              message("segments.gen open: FileNotFoundException " + e);
              break;
            } catch (IOException e) {
              message("segments.gen open: IOException " + e);
            }
  
            if (genInput != null) {
              try {
                int version = genInput.readInt();
                if (version == FORMAT_LOCKLESS) {
                  long gen0 = genInput.readLong();
                  long gen1 = genInput.readLong();
                  message("fallback check: " + gen0 + "; " + gen1);
                  if (gen0 == gen1) {
                    // The file is consistent.
                    genB = gen0;
                    break;
                  }
                }
              } catch (IOException err2) {
                // will retry
              } finally {
                genInput.close();
              }
            }
            try {
              Thread.sleep(defaultGenFileRetryPauseMsec);
            } catch (InterruptedException ie) {
              throw new ThreadInterruptedException(ie);
            }
          }

          message(IndexFileNames.SEGMENTS_GEN + " check: genB=" + genB);

          // Pick the larger of the two gen‘s:
          if (genA > genB)
            gen = genA;
          else
            gen = genB;
          
          if (gen == -1) {
            // Neither approach found a generation
            throw new FileNotFoundException("no segments* file found in " + directory + ": files: " + Arrays.toString(files));
          }
        }

        // Third method (fallback if first & second methods
        // are not reliable): since both directory cache and
        // file contents cache seem to be stale, just
        // advance the generation.
        if (1 == method || (0 == method && lastGen == gen && retry)) {

          method = 1;

          if (genLookaheadCount < defaultGenLookaheadCount) {
            gen++;
            genLookaheadCount++;
            message("look ahead increment gen to " + gen);
          }
        }

        if (lastGen == gen) {

          // This means we‘re about to try the same
          // segments_N last tried.  This is allowed,
          // exactly once, because writer could have been in
          // the process of writing segments_N last time.

          if (retry) {
            // OK, we‘ve tried the same segments_N file
            // twice in a row, so this must be a real
            // error.  We throw the original exception we
            // got.
            throw exc;
          } else {
            retry = true;
          }

        } else if (0 == method) {
          // Segment file has advanced since our last loop, so
          // reset retry:
          retry = false;
        }

        lastGen = gen;

        segmentFileName = IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS,
                                                                "",
                                                                gen);

        try {
          Object v = doBody(segmentFileName);
          if (exc != null) {
            message("success on " + segmentFileName);
          }
          return v;
        } catch (IOException err) {

          // Save the original root cause:
          if (exc == null) {
            exc = err;
          }

          message("primary Exception on ‘" + segmentFileName + "‘: " + err + "‘; will retry: retry=" + retry + "; gen = " + gen);

          if (!retry && gen > 1) {

            // This is our first time trying this segments
            // file (because retry is false), and, there is
            // possibly a segments_(N-1) (because gen > 1).
            // So, check if the segments_(N-1) exists and
            // try it if so:
            String prevSegmentFileName = IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS,
                                                                               "",
                                                                               gen-1);

            final boolean prevExists;
            prevExists = directory.fileExists(prevSegmentFileName);

            if (prevExists) {
              message("fallback to prior segment file ‘" + prevSegmentFileName + "‘");
              try {
                Object v = doBody(prevSegmentFileName);
                if (exc != null) {
                  message("success on fallback " + prevSegmentFileName);
                }
                return v;
              } catch (IOException err2) {
                message("secondary Exception on ‘" + prevSegmentFileName + "‘: " + err2 + "‘; will retry");
              }
            }
          }
        }
      }
    }

  

  就是判断是否被提交了,判断的参数:commit,从中可以取到directory,得到当前段所在的位置,并判断是否被修改过,如果没有,就从commit中获取segmentFileName就执行doBody(segmentFileName)。

解释一下indexCommit:

public abstract class IndexCommit {

  public abstract String getSegmentsFileName();

  public abstract Collection<String> getFileNames() throws IOException;

  public abstract Directory getDirectory();

  public void delete() {
    throw new UnsupportedOperationException("This IndexCommit does not support this method.");
  }

  public boolean isDeleted() {
    throw new UnsupportedOperationException("This IndexCommit does not support this method.");
  }


  public boolean isOptimized() {
    throw new UnsupportedOperationException("This IndexCommit does not support this method.");
  }


  @Override
  public boolean equals(Object other) {
    if (other instanceof IndexCommit) {
      IndexCommit otherCommit = (IndexCommit) other;
      return otherCommit.getDirectory().equals(getDirectory()) && otherCommit.getVersion() == getVersion();
    } else
      return false;
  }

  @Override
  public int hashCode() {
    return getDirectory().hashCode() + getSegmentsFileName().hashCode();
  }

  public long getVersion() {
    throw new UnsupportedOperationException("This IndexCommit does not support this method.");
  }

  public long getGeneration() {
    throw new UnsupportedOperationException("This IndexCommit does not support this method.");
  }

  public long getTimestamp() throws IOException {
    return getDirectory().fileModified(getSegmentsFileName());
  }

  public Map<String,String> getUserData() throws IOException {
    throw new UnsupportedOperationException("This IndexCommit does not support this method.");
  }
}

indexcommit是getSegmentsFileName,getDirectory的vo,从中可以得到段名称和目录。

回到DirectoryReader.open()中,里面调用了segmentInfos.FindSegmentsFile(dir){doBody(){ }},实现了doBody方法,将段名称传给dobody(),然后run(commit).

上面找到段信息的主要执行流程:

找到最新的segment_N

  • 由于segment_N是整个索引过程中的元数据信息,因而正确的选择segment_N更加重要。
  • 索引有可能部署在分布式系统中,在多台机器中都有,所以,需要保证索引的安全。
  • 一方面取到segment_N,另一方面取到最大的N,设为genA

 

String[] files = directory.listAll();

long genA = getCurrentSegmentGeneration(files);

 

 

long getCurrentSegmentGeneration(String[] files) {

    long max = -1;

    for (int i = 0; i < files.length; i++) {

      String file = files[i];

      if (file.startsWith(IndexFileNames.SEGMENTS) //"segments_N"

          && !file.equals(IndexFileNames.SEGMENTS_GEN)) { //"segments.gen"

        long gen = generationFromSegmentsFileName(file);

        if (gen > max) {

          max = gen;

        }

      }

    }

    return max;

  }

 

  另一方面,打开segment_gen,从中得到genB,在genA和genB中去较大者,为gen,并用此gen构造要打开的segments_N的文件名.

 

 

 

 

 

 

 

 

 

 

---恢复内容结束---

lucene的搜索过程(索引文件)

标签:

原文地址:http://www.cnblogs.com/mggwct/p/4767177.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!