Solr相似性算法

时间：2018-02-02 18:39:23 阅读：430 评论：0 收藏：0 [点我收藏+]

标签：关键词 logs nec nts index cto log div ide

Solr相似性算法

介绍

Solr 4及之前的版本默认采用VSM(向量空间模型)进行相似度的计算（或打分）。之后的版本，则采用Okapi BM25（一种二元独立模型的扩展），属于概率模型。

检索模型通常分为：

二元模型
向量空间模型（VSM）
- tfidf
- 基于关键词的检索
概率模型
- Okapi BM25
机器学习模型

similarity标签

    <similarity>用于声明相似度计算模型，可以由用户定制。
    示例如下：
      <similarity class="solr.DFRSimilarityFactory">
          <str name="basicModel">P</str>
          <str name="afterEffect">L</str>
          <str name="normalization">H2</str>
          <float name="c">7</float>
      </similarity>

该标签能够支持特定field type的相似度计算。

VSM

VSM的score公式如下：

Okapi BM25

https://events.static.linuxfound.org/sites/events/files/slides/bm25.pdf

    Score(q,    d)  =           
                        ∑       idf(t)  ·   (   tf(t    in  d)  ·   (k  +   1)  )   /   (   tf(t    in  d)  +   k   ·   (1  –   b   +   b   ·   |d| /   avgdl   )   
                                            t   in  q   
Where:      
                                    t   =   term;   d   =   document;   q   =   query;  i   =   index
                                    tf(t    in  d)      =       numTermOccurrencesInDocument    ?   
                                    idf(t)  =       1   +   log (numDocs    /   (docFreq    +   1)) 
                                    |d| =       ∑   1   
                                                                                                                    t   in  d   
                                    avgdl = (   ∑   |d|     )   /   (   ∑   1   )   )   
                                                                                                                                                                d   in  i                                                           d   in  i   
                                    k   =   Free    parameter.  Usually ~1.2    to  2.0.    Increases   term    frequency   saturation  point.  
                                    b   =   Free    parameter.  Usually ~0.75.  Increases   impact  of  document    normalization.

## Learning to Rank (LTR)
solr也是支持LTR的。
这一块要求有Machine Learning的基础。没有的话，就边看文档，边查吧。像我这样的，只能先跳过了（-_-）。
具体可以看文档：
https://lucene.apache.org/solr/guide/6_6/learning-to-rank.html
https://www.microsoft.com/en-us/research/project/mslr/
https://events.static.linuxfound.org/sites/events/files/slides/bm25.pdf
http://opensourceconnections.com/blog/2014/12/08/title-search-when-relevancy-is-only-skin-deep/
https://lucene.apache.org/solr/guide/6_6/relevance.html

Solr相似性算法

标签：关键词 logs nec nts index cto log div ide

原文地址：https://www.cnblogs.com/lotushy/p/8406143.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行