Solr In Action 笔记(1) 之 Key Solr Concepts

时间：2014-10-31 01:14:42 阅读：340 评论：0 收藏：0 [点我收藏+]

标签：Lucene style blog http io color os ar 使用

Solr In Action 笔记(1) 之 Key Solr Concepts

题记：看了下《Solr In Action》还是收益良多的，只是奈何没有中文版，只能查看英语原版有点类，第一次看整本的英语书，就当复习下英语并顺便做下笔记吧。

1. Solr的框架

bubuko.com,布布扣

从这张图上看Solr的组件还是很齐全以及清楚明了的，但是当你看Solr源码的时候就会发现，哎呀咋看起来这么类呢。

2. Solr的查询方式

bubuko.com,布布扣

上面两张图分别举例了Solr的几个QueryComponent，比如facet，More like this，highlighting ，spatial ，以及spellcheck component 。

3. Solr的优化提示

(1) Solr支持多个Core，可以对Solr的Core按时间划分为历史Core以及现在Core，分别存放以前的历史数据以及现在的数据，或者将Core按使用类别划分。

(2) 提升Solr的并发查询性能的一个方法就是，增加shard以及replica个数，这是由于SolrCloud的查询方式是根据clusterstate.json的shard的顺序进行查询的，当shard和replica个数多的时候，对Solrcloud的并发查询就会进行分流。

4. Lucene的倒排表结构

可以通过以下表格来加深理解倒排表，通过term我们可以快速定位到具体的Document，然后再根据Document快速取出所有stored的field的内容。

bubuko.com,布布扣

5. 查询方式

5.1 BooleanQuery

BooleanQuery是很基础的一个查询方式，它就是对单个或者多个Term用AND,OR,NOT关系连接起来，它主要分为以下几种方式，假设以下的倒排表格式

bubuko.com,布布扣

通过Term Query查询new 和 home就可以分别获取他们的document如下，接下来分别对不同的boolean的方式对home和new的组合进行查询。

bubuko.com,布布扣

5.1.1 REQUIRED TERMS

+new +house
new AND house

上述两种查询虽然在逻辑上是一致的，但是在物理上还是有区别的，+是单目运算，AND是双目运算。

5.1.2 OPTIONAL TERMS

new house
new OR house

第一个查询是因为默认为操作符是OR

5.1.3 NEGATED TERMS

new house –rental
new house NOT rental

bubuko.com,布布扣

这里需要说明一下多个term的boolean查询性能，由于倒排表的特性，对多个term的boolean查询其实是需要先取出来每个term的doc然后再进行处理，也就说有几个term就需要遍历几遍，当然Lucene在这一块是有优化的，以AND为例，请看以下代码：

 1 private int doNext(int doc) throws IOException {
 2     for(;;) {
 3       // doc may already be NO_MORE_DOCS here, but we don‘t check explicitly
 4       // since all scorers should advance to NO_MORE_DOCS, match, then
 5       // return that value.
 6       advanceHead: for(;;) {
 7         for (int i = 1; i < docsAndFreqs.length; i++) {
 8           // invariant: docsAndFreqs[i].doc <= doc at this point.
 9 
10           // docsAndFreqs[i].doc may already be equal to doc if we "broke advanceHead"
11           // on the previous iteration and the advance on the lead scorer exactly matched.
12           if (docsAndFreqs[i].doc < doc) {
13             docsAndFreqs[i].doc = docsAndFreqs[i].scorer.advance(doc);
14 
15             if (docsAndFreqs[i].doc > doc) {
16               // DocsEnum beyond the current doc - break and advance lead to the new highest doc.
17               doc = docsAndFreqs[i].doc;
18               break advanceHead;
19             }
20           }
21         }
22         // success - all DocsEnums are on the same doc
23         return doc;
24       }
25       // advance head for next iteration
26       doc = lead.doc = lead.scorer.advance(doc);
27     }
28   }

首先，获取符合第一个查询条件的第一个doc ID ，记为A，

第二，遍历其他的查询条件，获取第二个查询条件的doc id，记为B，如果B大于A，说明没有即符合A又符合B的Doc ID，那么第一个查询条件就会尝试获取大于等于B的Doc ID开始新的一轮循环。

第三，如果B刚好等于A，说明即有符合A又有符合B的Doc ID，所以获取第三个查询的条件的Doc ID，记为C，再多A和C进行比较，之后就跟第二步一样。

最后，当遍历所有查询条件，如果A符合所有查询条件则说明返回A，否则就返回最大值表示没有解。

从上面的过程可以看出，多个term的查询性能还是很耗时的。

再者就是多个Term的Boolean的准确性问题：

假设我们查询New AND House，那么结果出来的Document都包含New 和 House，但是如果我查的是要求是New House 连一起的，那么用BooleanQuery出来的结果可能会包含House New，或者 new XXXXX house这样的并不符合要求的结果，这就是Boolean查询的不足之处。

最后我们来看下，Solr是怎么处理Required Term ，OPTIONAL Term， NEGATED Term的评分因子的，以上三个分别对应required，prohibited，optional

 1     public Scorer scorer(AtomicReaderContext context, Bits acceptDocs)
 2         throws IOException {
 3       List<Scorer> required = new ArrayList<>();
 4       List<Scorer> prohibited = new ArrayList<>();
 5       List<Scorer> optional = new ArrayList<>();
 6       Iterator<BooleanClause> cIter = clauses.iterator();
 7       for (Weight w  : weights) {
 8         BooleanClause c =  cIter.next();
 9         Scorer subScorer = w.scorer(context, acceptDocs);
10         if (subScorer == null) {
11           if (c.isRequired()) {
12             return null;
13           }
14         } else if (c.isRequired()) {
15           required.add(subScorer);
16         } else if (c.isProhibited()) {
17           prohibited.add(subScorer);
18         } else {
19           optional.add(subScorer);
20         }
21       }
22 
23       if (required.size() == 0 && optional.size() == 0) {
24         // no required and optional clauses.
25         return null;
26       } else if (optional.size() < minNrShouldMatch) {
27         // either >1 req scorer, or there are 0 req scorers and at least 1
28         // optional scorer. Therefore if there are not enough optional scorers
29         // no documents will be matched by the query
30         return null;
31       }
32       
33       // simple conjunction
34       if (optional.size() == 0 && prohibited.size() == 0) {
35         float coord = disableCoord ? 1.0f : coord(required.size(), maxCoord);
36         return new ConjunctionScorer(this, required.toArray(new Scorer[required.size()]), coord);
37       }
38       
39       // simple disjunction
40       if (required.size() == 0 && prohibited.size() == 0 && minNrShouldMatch <= 1 && optional.size() > 1) {
41         float coord[] = new float[optional.size()+1];
42         for (int i = 0; i < coord.length; i++) {
43           coord[i] = disableCoord ? 1.0f : coord(i, maxCoord);
44         }
45         return new DisjunctionSumScorer(this, optional.toArray(new Scorer[optional.size()]), coord);
46       }
47       
48       // Return a BooleanScorer2
49       return new BooleanScorer2(this, disableCoord, minNrShouldMatch, required, prohibited, optional, maxCoord);
50     }

5.2 短语查询

"new home" OR "new house"
"3 bedrooms" AND "walk in closet" AND "granite countertops"

在BooleanQuery中，分析了它的劣势，查询的不准性，以及性能的耗时。Phrase queries 在查找连着的term时完美的解决以上两个问题，它主要用到了Term Position。下表是带有term position的倒排表格式，term position很清楚明白的记录了，每一个term在其document的位置，虽然它增加了索引文件的大小，但是却为我们的Pharse Query带来了大大的便利。

bubuko.com,布布扣

同样出去new 和 house的信息，可以看出在document 5和8中，new 和 home是连着的，这提高了查询速度也提高了查询质量。就查询质量进行排序，PhraseQuery > BooleanQuery > FuzzyQuery

bubuko.com,布布扣

5.3 FuzzyQuery

明天继续

Solr In Action 笔记(1) 之 Key Solr Concepts

标签：Lucene style blog http io color os ar 使用

原文地址：http://www.cnblogs.com/rcfeng/p/4064065.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行