sphinx续4-coreseek的工作原理

时间：2016-05-13 23:13:35 阅读：242 评论：0 收藏：0 [点我收藏+]

标签：

原文地址：http://blog.itpub.net/29806344/viewspace-1399621/

在分析sphix原理之前，我先澄清一下为什么经常出现coreseek这个词？

因为sphinx默认不支持中文索引及检索，而coreseek基于sphinx开发了coreseek全文检索服务器，它提供了为sphinx设计的中文分词包libmmseg包含mmseg中文分词，是目前用的最多的sphinx中文检索。
在没有sphinx之前，mysql数据库要对海量的文章中的词进行全文索引，一般用的语句例如：SELECT *** WHERE *** LIKE ‘%word%‘;这样的LIKE查询，并且再结合通配符%，是使用不到mysql本身的索引，需要全表扫描，时间超慢的！

如果用到sphinx，全文索引交给sphinx来做，sphinx返回含有该word的ID号，然后用该ID号直接去数据库准确定位那些数据，整个过程如下图：

技术分享

sphinx的索引文件存储的不是完整的数据，只是由ID和分词组成的数组，由于索引文件不同直接查看，但我们可以通过search工具来验证：

先建索引：

/usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/sphinx.conf

Coreseek Fulltext 4.1 [ Sphinx 2.0.2-dev (r2922)]

Beijing Choice Software Technologies Inc (http://www.coreseek.com)

再通过search 查找单词test:

/usr/local/coreseek/bin/search test -c /usr/local/coreseek/etc/sphinx.conf

Coreseek Fulltext 4.1 [ Sphinx 2.0.2-dev (r2922)]

Beijing Choice Software Technologies Inc (http://www.coreseek.com)

using config file ‘/usr/local/coreseek/etc/sphinx.conf‘...

index ‘test1‘: query ‘test ‘: returned 3 matches of 3 total in 0.050 sec

displaying matches:

1. document=1, weight=2421, group_id=1, date_added=Thu Jan 8 21:43:32 2015

id=1

group_id=1

group_id2=5

date_added=2015-01-08 21:43:32

title=test one

content=this is my test document number one. also checking search within phrases.

2. document=2, weight=2421, group_id=1, date_added=Thu Jan 8 21:43:32 2015

id=2

group_id=1

group_id2=6

date_added=2015-01-08 21:43:32

title=test two

content=this is my test document number two

3. document=4, weight=1442, group_id=2, date_added=Thu Jan 8 21:43:32 2015

id=4

group_id=2

group_id2=8

date_added=2015-01-08 21:43:32

title=doc number four

content=this is to test groups

words:

1. ‘test‘: 3 documents, 5 hits

再通过search 查找单词this:

/usr/local/coreseek/bin/search this -c /usr/local/coreseek/etc/sphinx.conf

Coreseek Fulltext 4.1 [ Sphinx 2.0.2-dev (r2922)]

Beijing Choice Software Technologies Inc (http://www.coreseek.com)

using config file ‘/usr/local/coreseek/etc/sphinx.conf‘...

index ‘test1‘: query ‘this ‘: returned 4 matches of 4 total in 0.000 sec

displaying matches:

1. document=1, weight=1304, group_id=1, date_added=Thu Jan 8 21:43:32 2015

id=1

group_id=1

group_id2=5

date_added=2015-01-08 21:43:32

title=test one

content=this is my test document number one. also checking search within phrases.

2. document=2, weight=1304, group_id=1, date_added=Thu Jan 8 21:43:32 2015

id=2

group_id=1

group_id2=6

date_added=2015-01-08 21:43:32

title=test two

content=this is my test document number two

3. document=3, weight=1304, group_id=2, date_added=Thu Jan 8 21:43:32 2015

id=3

group_id=2

group_id2=7

date_added=2015-01-08 21:43:32

title=another doc

content=this is another group

4. document=4, weight=1304, group_id=2, date_added=Thu Jan 8 21:43:32 2015

id=4

group_id=2

group_id2=8

date_added=2015-01-08 21:43:32

title=doc number four

content=this is to test groups

words:

1. ‘this‘: 4 documents, 4 hits

由此，我们可以看到，search 关键词主要返回的是含有表ID和命中率的数组。

注意：不知道大家有没有想到一个致命的问题，创建了sphinx全文索引后，如果在mysql中新增加数据，不重新indexer一下，sphinx索引是搜索不到的！即使是加参数–rotate,数据多的情况下，也要很长时间，这个问题怎么解决呢！明天就来讲主索引和增量索引，以及用cron来处理新数据自动加入增量索引中。

补充：

技术分享

sphinx续4-coreseek的工作原理

标签：

原文地址：http://www.cnblogs.com/bjfy/p/5491424.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行