标签:Lucene style blog color io os java ar strong
一个索引对应一个目录,索引文件都存放在目录里面。Solr的索引文件存放在Solr/Home下的core/data/index目录中,一个core对应一个索引。
Segments_N例举了索引所有有效的segments信息以及删除的具体信息,一个索引可以有多个Segments_N,但是有效的往往总是N最大的那个,为什么会出现多个segments_N,主要是由于暂时无法删除它们或者有indexwriter在进行commit操作,又或者IndexDeletionPolicy
在进行。Segments_N的代码主要在Segmentsinfo.java里面。
如何选择Segments_N文件进行读取:
1 String[] files = null; 2 long genA = -1; 3 files = directory.listAll(); 4 if (files != null) { 5 genA = getLastCommitGeneration(files); 6 } 7 8 ...
1 public static long getLastCommitGeneration(String[] files) { 2 if (files == null) { 3 return -1; 4 } 5 long max = -1; 6 for (String file : files) { 7 if (file.startsWith(IndexFileNames.SEGMENTS) && !file.equals(IndexFileNames.SEGMENTS_GEN)) { 8 long gen = generationFromSegmentsFileName(file); 9 if (gen > max) { 10 max = gen; 11 } 12 } 13 } 14 return max; 15 }
1 long genB = -1; 2 ChecksumIndexInput genInput = null; 3 try { 4 genInput = directory.openChecksumInput(IndexFileNames.SEGMENTS_GEN, IOContext.READONCE); 5 } catch (IOException e) { 6 ... 7 int version = genInput.readInt(); 8 long gen0 = genInput.readLong(); 9 long gen1 = genInput.readLong();
10if (gen0 == gen1) {
genB = gen0;
}
在上述得到的genA和genB中选择最大的那个作为当前的N,方才打开segments_N文件
1 gen = Math.max(genA, genB);
Segment的结构:
Header, Version, NameCounter, SegCount, <SegName, SegCodec, DelGen, DeletionCount, FieldInfosGen, UpdatesFiles>SegCount, CommitUserData, Footer
其中<SegName, SegCodec, DelGen, DeletionCount, FieldInfosGen, UpdatesFiles>表示一个段的信息,SegCount表示段的数量,所以
<SegName, SegCodec, DelGen, DeletionCount, FieldInfosGen, UpdatesFiles>SegCount 表示这样的SegCount个段连在一起。
CodecHeader,包含了Magic,CodecName,Version三部分。
Magic是一个开始表示符,通常情况下为1071082519.
CodecName是文件的标识符
Version索引文件版本信息,当用某个版本号的IndexReader读取另一个版本号生成的索引的时候,会因为此值不同而报错。
1 public static int checkHeader(DataInput in, String codec, int minVersion, int maxVersion) 2 throws IOException { 3 4 // Safety to guard against reading a bogus string: 5 final int actualHeader = in.readInt(); //读取Magic 6 if (actualHeader != CODEC_MAGIC) { 7 throw new CorruptIndexException("codec header mismatch: actual header=" + actualHeader + " vs expected header=" + CODEC_MAGIC + " (resource: " + in + ")"); 8 } 9 return checkHeaderNoMagic(in, codec, minVersion, maxVersion); //读取CodecName和Version,并判断 10 }
1 public boolean isCurrent() throws IOException { 2 ensureOpen(); 3 if (writer == null || writer.isClosed()) { 4 // Fully read the segments file: this ensures that it‘s 5 // completely written so that if 6 // IndexWriter.prepareCommit has been called (but not 7 // yet commit), then the reader will still see itself as 8 // current: 9 SegmentInfos sis = new SegmentInfos(); 10 sis.read(directory); 11 12 // we loaded SegmentInfos from the directory 13 return sis.getVersion() == segmentInfos.getVersion(); 14 } else { 15 return writer.nrtIsCurrent(segmentInfos); 16 }
可以通过查看read()函数来对照Segment_N的格式
1 public final void read(Directory directory, String segmentFileName) throws IOException { 2 boolean success = false; 3 4 // Clear any previous segments: 5 this.clear(); 6 //获取现在的segment代号,即Segment_N的N值 7 generation = generationFromSegmentsFileName(segmentFileName); 8 9 lastGeneration = generation; 10 //获取检验和 11 ChecksumIndexInput input = directory.openChecksumInput(segmentFileName, IOContext.READ); 12 try { 13 //获取Header的Magic一般情况下为1071082519 ,Header由Magic,Codecname以及Version组成 14 final int format = input.readInt(); 15 final int actualFormat; 16 if (format == CodecUtil.CODEC_MAGIC) { 17 // 4.0+ 获取Header的Codecname信息 18 actualFormat = CodecUtil.checkHeaderNoMagic(input, "segments", VERSION_40, VERSION_48); 19 version = input.readLong(); //获取Header的version信息 20 counter = input.readInt(); //获取NameCount,即下一段新的段名 21 int numSegments = input.readInt(); //获取segment个数 22 if (numSegments < 0) { 23 throw new CorruptIndexException("invalid segment count: " + numSegments + " (resource: " + input + ")"); 24 } 25 //遍历SegCount个的段数据 26 for(int seg=0;seg<numSegments;seg++) { 27 String segName = input.readString(); //SegName 28 Codec codec = Codec.forName(input.readString()); //SegCodec 29 //System.out.println("SIS.read seg=" + seg + " codec=" + codec); 30 SegmentInfo info = codec.segmentInfoFormat().getSegmentInfoReader().read(directory, segName, IOContext.READ); 31 info.setCodec(codec); 32 long delGen = input.readLong(); //DelGen 33 int delCount = input.readInt(); //DeletionCount 34 if (delCount < 0 || delCount > info.getDocCount()) { 35 throw new CorruptIndexException("invalid deletion count: " + delCount + " vs docCount=" + info.getDocCount() + " (resource: " + input + ")"); 36 } 37 long fieldInfosGen = -1; 38 if (actualFormat >= VERSION_46) { 39 fieldInfosGen = input.readLong(); //FieldInfosGen 40 } 41 SegmentCommitInfo siPerCommit = new SegmentCommitInfo(info, delCount, delGen, fieldInfosGen); 42 if (actualFormat >= VERSION_46) { 43 //UpdatesFiles 首先读取UpdatesFiles的个数,如果等于0则后续接着没有更新的文件, 44 //否则就获取所有numGensUpdatesFiles个文件并写入SegmentCommitInfo中。 45 int numGensUpdatesFiles = input.readInt(); 46 final Map<Long,Set<String>> genUpdatesFiles; 47 if (numGensUpdatesFiles == 0) { 48 genUpdatesFiles = Collections.emptyMap(); 49 } else { 50 genUpdatesFiles = new HashMap<>(numGensUpdatesFiles); 51 for (int i = 0; i < numGensUpdatesFiles; i++) { 52 genUpdatesFiles.put(input.readLong(), input.readStringSet()); 53 } 54 } 55 siPerCommit.setGenUpdatesFiles(genUpdatesFiles); 56 } 57 add(siPerCommit); 58 } 59 userData = input.readStringStringMap(); //CommitUserData 60 } else { 61 actualFormat = -1; 62 Lucene3xSegmentInfoReader.readLegacyInfos(this, directory, input, format); 63 Codec codec = Codec.forName("Lucene3x"); 64 for (SegmentCommitInfo info : this) { 65 info.info.setCodec(codec); 66 } 67 } 68 //Footer 69 if (actualFormat >= VERSION_48) { 70 CodecUtil.checkFooter(input); 71 } else { 72 final long checksumNow = input.getChecksum(); 73 final long checksumThen = input.readLong(); 74 if (checksumNow != checksumThen) { 75 throw new CorruptIndexException("checksum mismatch in segments file (resource: " + input + ")"); 76 } 77 CodecUtil.checkEOF(input); 78 } 79 80 success = true; 81 } finally { 82 if (!success) { 83 // Clear any segment infos we had loaded so we 84 // have a clean slate on retry: 85 this.clear(); 86 IOUtils.closeWhileHandlingException(input); 87 } else { 88 input.close(); 89 } 90 } 91 }
对照read()和write(),基本上可以看出write是read是互逆的过程。
read的过程首要保证的是我们读到的segment是最新的。read()是个不停循环尝试读取最新segmentinfo的过程,如果发生IOException则说明此时正在进行commit操作,那么这个时候获取的segment信息就不是最新的。Lucene提供三种方法来尝试获取最新的segment信息:
1.首先就是前文提到的获取最大的gen(generation),当尝试两次之后,如果最大的gen大于lastgen说明segment信息已经更新,否则说明没有更新或者该方法不适用所以转入第二这种方法。
2. 如果第一种方法失败,则直接gen++,即直接去解析下一个gen的segment_N文件。
3. 如果解析失败,则进行gen的回退,gen--,尝试解析该gen的segment_N文件,即segment信息并未更新
write的过程跟read()是相反,这里主要想了解下SegmentCommitInfo与genUpdatesFiles。
首先看下write的调用关系:commit操作分为两部分prepareCommit和finishcommit。prepareCommit调用write即在segment_N中写入新生成段的信息,之后在finishcommit中进行真正的commit操作,如果操作失败就进行回归。commit成功后再把gen信息写入segment.gen.
1 final void prepareCommit(Directory dir) throws IOException { 2 if (pendingSegnOutput != null) { 3 throw new IllegalStateException("prepareCommit was already called"); 4 } 5 write(dir); 6 }
1 for (SegmentCommitInfo siPerCommit : this) { 2 SegmentInfo si = siPerCommit.info; 3 segnOutput.writeString(si.name); 4 segnOutput.writeString(si.getCodec().getName()); 5 segnOutput.writeLong(siPerCommit.getDelGen()); 6 int delCount = siPerCommit.getDelCount(); 7 if (delCount < 0 || delCount > si.getDocCount()) { 8 throw new IllegalStateException("cannot write segment: invalid docCount segment=" + si.name + " docCount=" + si.getDocCount() + " delCount=" + delCount); 9 } 10 segnOutput.writeInt(delCount); 11 segnOutput.writeLong(siPerCommit.getFieldInfosGen()); 12 final Map<Long,Set<String>> genUpdatesFiles = siPerCommit.getUpdatesFiles(); 13 segnOutput.writeInt(genUpdatesFiles.size()); 14 for (Entry<Long,Set<String>> e : genUpdatesFiles.entrySet()) { 15 segnOutput.writeLong(e.getKey()); 16 segnOutput.writeStringSet(e.getValue()); 17 } 18 ... 19 }
未完待续
Solr4.8.0源码分析(9)之Lucene的索引文件(2)
标签:Lucene style blog color io os java ar strong
原文地址:http://www.cnblogs.com/rcfeng/p/3976135.html