标签:
索引系统需要通过主查询来获取全部的文档信息,一种简单的实现是将整个表的数据读入内存,但是这可能导致整个表被锁定并使得其他操作被阻止(例如:在MyISAM格式上的INSERT操作),同时,将浪费大量内存用于存储查询结果,诸如此类的问题吧。 为了避免出现这种情况,CoreSeek/Sphinx支持一种被称为 区段查询的技术. 首先,CoreSeek/Sphinx从数据库中取出文档ID的最小值和最大值,将由最大值和最小值定义自然数区间分成若干份,一次获取数据,建立索引。现举例如下:
例 3.1. 范围查询用法举例
# in sphinx.conf sql_query_range = SELECT MIN(id),MAX(id) FROM documents sql_range_step = 1000 sql_query = SELECT * FROM documents WHERE id>=$start AND id<=$end
如果这个表(documents)中,字段ID的最小值和最大值分别是1 和2345,则sql_query将执行3次:
$start
替换为1,并且将 $end
替换为 1000;$start
替换为1001,并且将 $end
替换为 2000;$start
替换为2001,并且将 $end
替换为 2345.显然,这对于只有2000行的表,分区查询与整个读入没有太大区别,但是当表的规模扩大到千万级(特别是对于MyISAM格式的表),分区区段查询将提供一些帮助。
之前做一套域名MX解析系统的时候获取过几百万的域名www title 信息,下面就用检索www 网站titile 数据来测试。
编辑用于测试的 coreseek 配置文件 csft.range.conf
source src { type = mysql # some straightforward parameters for SQL source types sql_host = localhost sql_user = root sql_pass = xxxxxxxxxxxxx sql_db = whomx sql_port = 3306 # optional, default is 3306
sql_query_pre = SET NAMES utf8
sql_query_pre = SET SESSION query_cache_type=OFF
sql_query = \
SELECT i.id,title \
FROM mx_domain_wwwinfo i \
WHERE id>=$start AND id<=$end
sql_query_range = SELECT MIN(id),MAX(id) FROM mx_domain_wwwinfo
}
index 配置只需要配置中文字符编码 还有中文词库的位置就可
indexer searchd 不需要更改。
接下来测试一下,
root@timeless-HP-Pavilion-g4-Notebook-PC:/usr/local/coreseek/etc# /usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/csft.range.conf --all --rotate Coreseek Fulltext 4.1 [ Sphinx 2.0.2-dev (r2922)] Copyright (c) 2007-2011, Beijing Choice Software Technologies Inc (http://www.coreseek.com) using config file ‘/usr/local/coreseek/etc/csft.range.conf‘... WARNING: failed to open pid_file ‘/usr/local/coreseek/var/log/searchd.pid‘. indexing index ‘src‘... WARNING: Attribute count is 0: switching to none docinfo collected 1500837 docs, 92.2 MB sorted 17.5 Mhits, 100.0% done total 1500837 docs, 92186221 bytes total 34.680 sec, 2658122 bytes/sec, 43275.54 docs/sec total 16 reads, 0.023 sec, 3631.5 kb/call avg, 1.4 msec/call avg total 143 writes, 0.105 sec, 912.1 kb/call avg, 0.7 msec/call avg
root@timeless-HP-Pavilion-g4-Notebook-PC:/usr/local/coreseek/bin# ./search -c /usr/local/coreseek/etc/csft.range.conf 济南 Coreseek Fulltext 4.1 [ Sphinx 2.0.2-dev (r2922)] Copyright (c) 2007-2011, Beijing Choice Software Technologies Inc (http://www.coreseek.com) using config file ‘/usr/local/coreseek/etc/csft.range.conf‘... index ‘src‘: query ‘济南 ‘: returned 1000 matches of 11040 total in 0.005 sec displaying matches: 1. document=53592, weight=1664 id=53592 domain_id=75937 title=?????????????,??????,??????,?????????????????????????????????????,??????,??????,????,????,??????,??????,??????,??????,????????,????,??????,??????,??????,??????,????????,????????,?????????,????????? addtime=1419001556 2. document=156494, weight=1663 id=156494 domain_id=320070 title=??--??????,?????,????????,????????,??????,??????,?????,?????,???,?????,?????,?????,?????,????,????,??????,??????,??????,?????,?????,?????,???,???????, addtime=1419041933 3. document=53624, weight=1661 id=53624 domain_id=74960 title=???????-???.??.???.???????/?????/?????/?????????/?????/?????/???? ????? ????? ????? ????? ????? ????? ??POS??? ????? ????? ????? ??POS? addtime=1419001559 4. document=908267, weight=1661 id=908267 domain_id=3482035 title=???????-???.??.???.???????/?????/?????/?????????/?????/?????/???? ????? ????? ????? ????? ????? ????? ??POS??? ????? ????? ????? ??POS? addtime=1421983846 5. document=1074259, weight=1659 id=1074259 domain_id=2805964 title=?????? - ???? | ????? | ????? | ?????? | ?????? | ?????? | ?????? | ?????? | ?????? | ?????? | ?????? | ?????? | ?????? | ?????? addtime=1421998317 6. document=628662, weight=1658 id=628662 domain_id=1603934 title=????????|????????|?????????|?????????|??????|?????|????????????????|????????|?????????|?????????|??????|?????|???????? addtime=1420628500 7. document=82498, weight=1656 id=82498 domain_id=75205 title=??????????????????????????????????????????????????????????????????????????????????????????????????????? addtime=1419030813 8. document=373234, weight=1656 id=373234 domain_id=75953 title=????|??????|????????|??????|??????|??????|??????|??????|??????|??????|??????-???????? addtime=1419481313 9. document=97657, weight=1655 id=97657 domain_id=75152 title=???????????????????????????????????????????????????????????????????? addtime=1419032238 10. document=108426, weight=1655 id=108426 domain_id=76651 title=??????|??????|??SKF??|??NSK??|??FAG??|??NTN??|??KOYO??|??TIMKEN??|??FAG??|????|????????|????????|??????|-?? addtime=1419033228 11. document=184337, weight=1655 id=184337 domain_id=75654 title=???????|??????????|???????|??????????|??????|???????|?????????|????????|?????????|???????|?????????? addtime=1419043496 12. document=246303, weight=1655 id=246303 domain_id=262037 title=???? ?????? ?????? ?????? ????? ?????? ???? ???? ?????? ???? ?????? addtime=1419046975 13. document=261372, weight=1655 id=261372 domain_id=544595 title=??????|????|?????|?????????|???????????|??????|??????|????|?????|?????|?????|????? addtime=1419215630 14. document=1163692, weight=1655 id=1163692 domain_id=2514244 title=??????????????????????????????????????????????????????????????????????????????_?????????????? addtime=1422005290 15. document=1163740, weight=1655 id=1163740 domain_id=2514240 title=?????????????????????????????????????????????????????????????????????????_?????????????? addtime=1422005293 16. document=1163762, weight=1655 id=1163762 domain_id=2514239 title=????????????????????????????????????????????????????????????????????????????_?????????????? addtime=1422005295 17. document=10694, weight=1653 id=10694 domain_id=454049 title=??????|??????|??????|??????|??????|?????|??????|????????|???????????|????????????|???400-070-3005 addtime=1418996572 18. document=15876, weight=1653 id=15876 domain_id=66098 title=????????? ???????????? ??????? ?????????? ??????? ??????? ????????? ??????? ??????? ???????_????????? addtime=1418997101 19. document=23385, weight=1653 id=23385 domain_id=421622 title=????0531-82825553|??????|??????????????|????????|???????|??????|????T1|????T3|????T6|????U8 addtime=1418997836 20. document=34628, weight=1653 id=34628 domain_id=320077 title=????|?????|?????|?????|?????|?????|?????|?????|????|?????????? addtime=1418998927 words: 1. ‘济南‘: 11040 documents, 22214 hits
可以看到结果 :1. ‘济南‘: 11040 documents, 22214 hits 以上显示只是 编码问题。
接下来还有个问题 比如 现在 要增量索引跟区段查询综合在一起怎么办? 下面文章根据百度文库里找到的一篇关于
《千万级Discuz!数据全文检索方案(Sphinx)》 综合使用coreseek 实现检索。
标签:
原文地址:http://www.cnblogs.com/timelesszhuang/p/4771106.html