码迷,mamicode.com
首页 > Windows程序 > 详细

Sphinx在windows下安装使用(支持中文全文检索)

时间:2015-07-19 13:23:09      阅读:540      评论:0      收藏:0      [点我收藏+]

标签:

前段时间听同事谈起过,公司内部的一个搜索功能用的就是Sphinx,但一直没时间去整一下,今天刚好有点时间,那么就折腾一次吧。一般在linux上比较多,今天就在windows下安装于调试一下吧。

前言:

一、关于Sphinx

Sphinx 是一个在GPLv2 下发布的一个全文检索引擎,商业授权(例如, 嵌入到其他程序中)需要联系作者(Sphinxsearch.com)以获得商业授权。
一般而言,Sphinx是一个独立的搜索引擎,意图为其他应用提供高速、低空间占用、高结果相关度的全文搜索功能。Sphinx可以非常容易的与SQL数据库和脚本语言集成。
当前系统内置MySQL和PostgreSQL 数据库数据源的支持,也支持从标准输入读取特定格式的XML数据。通过修改源代码,用户可以自行增加新的数据源(例如:其他类型的DBMS的原生支持)。
搜索API支持PHP、Python、Perl、Rudy和Java,并且也可以用作MySQL存储引擎。搜索API非常简单,可以在若干个小时之内移植到新的语言上。

本文旨在提供一种便捷的方式让Sphinx在windows下安装配置以支持中文全文检索,配置部分在linux下通用。

搜索API支持PHP、Python、Perl、Rudy和Java,并且也可以用作MySQL存储引擎。搜索API非常简单,可以在若干个小时之内移植到新的语言上。

Sphinx特性:

高速的建立索引(在当代CPU上,峰值性能可达到10MB/秒);
高性能的搜索(在2–4GB的文本数据上,平均每次检索响应时间小于0.1秒);
可处理海量数据(目前已知可以处理超过100GB的文本数据,在单一CPU的系统上可处理100M文档);
提供了优秀的相关度算法,基于短语相似度和统计(BM25)的复合Ranking方法;
支持分布式搜索;
提供文件的摘录生成;
可作为MySQL的存储引擎提供搜索服务;
支持布尔、短语、词语相似度等多种检索模式;
文档支持多个全文检索字段(最大不超过32个);
文档支持多个额外的属性信息(例如:分组信息,时间戳等);
停止词查询;
支持单一字节编码和UTF-8编码;
原生的MySQL支持(同时支持MyISAM和InnoDB);
原生的PostgreSQL支持.

中文手册下载:
sphinx_doc_zhcn_0.9

二、Sphinx在windows上的安装
1.直接在http://sphinxsearch.com/downloads/release/ 找到最新的windows版本,我这里下的是sphinx-2.0.6-release-win64-id64-full,下载后解压在E:\webserver\sphinx目录下。

2.在E:\webserver\sphinx 下新建一个data目录用来存放索引文件,一个log目录方日志文件,复制E:\webserver\sphinx\sphinx.conf.in到E:\webserver\sphinx\sphinx.conf(注意修改文件名)。

3.修改D:\sphinx\bin\sphinx.conf,我这里列出需要修改的几个:

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
type           = mysql # 数据源,我这里是mysql
sql_host       = localhost # 数据库服务器
sql_user       = root # 数据库用户名
sql_pass       = ‘‘ # 数据库密码
sql_db         = test # 数据库
sql_port       = 3306 # 数据库端口
sql_query_pre      = SET NAMES utf8 # 去掉此行前面的注释,如果你的数据库是uft8 编码的
index test1
{
# 放索引的目录
  path      = D:/sphinx/data/
# 编码
  charset_type     = utf-8
  #  指定utf-8 的编码表
  charset_table     = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F
  # 简单分词,只支持0 和1 ,如果要搜索中文,请指定为1
  ngram_len       = 1
# 需要分词的字符,如果要搜索中文,去掉前面的注释
  ngram_chars      = U+3000..U+2FA1F
}
# 搜索服务需要修改的部分
searchd
{
  # 日志
  log        = D:/sphinx/log/searchd.log
  # PID file, searchd process ID file name
  pid_file      = D:/sphinx/log/searchd.pid
  # windows 下启动searchd 服务一定要注释掉这个
  # seamless_rotate     = 1
}

如果没有分布式索引,注释掉下面的内容

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# index dist1
# {
 # ‘distributed‘ index type MUST be specified
 # type    = distributed
 
 # local index to be searched
 # there can be many local indexes configured
 # local    = test1
 # local    = test1stemmed
 
 # remote agent
 # multiple remote agents may be specified
 # syntax is ‘hostname:port:index1,[index2[,...]]
 # agent    = localhost:3313:remote1
 # agent    = localhost:3314:remote2,remote3
 
 # remote agent connection timeout, milliseconds
 # optional, default is 1000 ms, ie. 1 sec
 # agent_connect_timeout = 1000
 
 # remote agent query timeout, milliseconds
 # optional, default is 3000 ms, ie. 3 sec
 # agent_query_timeout  = 3000
# }

4.导入测试数据
E:\webserver\MySQL Server 5.5\bin>mysql -uroot test < E:/webserver/sphinx/example.sql

5.建立索引
技术分享
技术分享

E:\webserver\sphinx\bin>indexer.exe test1 ( 备注 :test1 为 sphinx.conf 的 index test1() )

6. 测试搜索‘this’
技术分享

7. 测试中文搜索‘我啊’
技术分享
貌似没有搜到,这是因为 windows 命令行中的编码是 gbk ,当然搜不出来。我们可以用程序试试,在 E:\webserver\sphinx\\api 下新建一个 foo.php 的文件,注意 utf-8 编码

1
2
3
4
5
6
7
8
<?php
//加载sphinx客户端
require ’sphinxapi.php’;
$s = new SphinxClient();
$s->SetServer(’localhost’,9312);
$result = $s->Query(‘中文’);
var_dump($result);
?>

启动 Sphinx searchd 服务
E:\webserver\sphinx\bin>searchd.exe
技术分享

执行 PHP 查询:
访问 http://www.test.com/sphinx/api/foo.php ( 自己配置的虚拟主机 )

至此,windows下Sphinx服务端的配置已结束。

贴出以上Sphinx测试的conf配置:

001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
#
# Sphinx configuration file sample
#
# WARNING! While this sample file mentions all available options,
# it contains (very) short helper descriptions only. Please refer to
# doc/sphinx.html for details.
#
 
#############################################################################
## data source definition
#############################################################################
 
source src1
{
    # data source type. mandatory, no default value
    # known types are mysql, pgsql, mssql, xmlpipe, xmlpipe2, odbc
    type            = mysql
 
    #####################################################################
    ## SQL settings (for ‘mysql‘ and ‘pgsql‘ types)
    #####################################################################
 
    # some straightforward parameters for SQL source types
    sql_host        = 127.0.0.1
    sql_user        = root
    sql_pass        = shadow
    sql_db          = test
    sql_port        = 3306  # optional, default is 3306
 
    # UNIX socket name
    # optional, default is empty (reuse client library defaults)
    # usually ‘/var/lib/mysql/mysql.sock‘ on Linux
    # usually ‘/tmp/mysql.sock‘ on FreeBSD
    #
    # sql_sock      = /tmp/mysql.sock
 
 
    # MySQL specific client connection flags
    # optional, default is 0
    #
    # mysql_connect_flags   = 32 # enable compression
 
    # MySQL specific SSL certificate settings
    # optional, defaults are empty
    #
    # mysql_ssl_cert        = /etc/ssl/client-cert.pem
    # mysql_ssl_key     = /etc/ssl/client-key.pem
    # mysql_ssl_ca      = /etc/ssl/cacert.pem
 
    # MS SQL specific Windows authentication mode flag
    # MUST be in sync with charset_type index-level setting
    # optional, default is 0
    #
    # mssql_winauth     = 1 # use currently logged on user credentials
 
 
    # MS SQL specific Unicode indexing flag
    # optional, default is 0 (request SBCS data)
    #
    # mssql_unicode     = 1 # request Unicode data from server
 
 
    # ODBC specific DSN (data source name)
    # mandatory for odbc source type, no default value
    #
    # odbc_dsn      = DBQ=C:\data;DefaultDir=C:\data;Driver={Microsoft Text Driver (*.txt; *.csv)};
    # sql_query     = SELECT id, data FROM documents.csv
 
 
    # ODBC and MS SQL specific, per-column buffer sizes
    # optional, default is auto-detect
    #
    # sql_column_buffers    = content=12M, comments=1M
 
 
    # pre-query, executed before the main fetch query
    # multi-value, optional, default is empty list of queries
    #
    #如果是utf-8,需要取消注视这行
    # sql_query_pre     = SET NAMES utf8
    sql_query_pre       = SET NAMES utf8
    # sql_query_pre     = SET SESSION query_cache_type=OFF
 
 
    # main document fetch query
    # mandatory, integer document ID field MUST be the first selected column
    sql_query       = \
        SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, title, content \
        FROM documents
 
 
    # joined/payload field fetch query
    # joined fields let you avoid (slow) JOIN and GROUP_CONCAT
    # payload fields let you attach custom per-keyword values (eg. for ranking)
    #
    # syntax is FIELD-NAME ‘from‘  ( ‘query‘ | ‘payload-query‘ ); QUERY
    # joined field QUERY should return 2 columns (docid, text)
    # payload field QUERY should return 3 columns (docid, keyword, weight)
    #
    # REQUIRES that query results are in ascending document ID order!
    # multi-value, optional, default is empty list of queries
    #
    # sql_joined_field  = tags from query; SELECT docid, CONCAT(‘tag‘,tagid) FROM tags ORDER BY docid ASC
    # sql_joined_field  = wtags from payload-query; SELECT docid, tag, tagweight FROM tags ORDER BY docid ASC
 
 
    # file based field declaration
    #
    # content of this field is treated as a file name
    # and the file gets loaded and indexed in place of a field
    #
    # max file size is limited by max_file_field_buffer indexer setting
    # file IO errors are non-fatal and get reported as warnings
    #
    # sql_file_field        = content_file_path
 
 
    # range query setup, query that must return min and max ID values
    # optional, default is empty
    #
    # sql_query will need to reference $start and $end boundaries
    # if using ranged query:
    #
    # sql_query     = \
    #   SELECT doc.id, doc.id AS group, doc.title, doc.data \
    #   FROM documents doc \
    #   WHERE id>=$start AND id<=$end
    #
    # sql_query_range       = SELECT MIN(id),MAX(id) FROM documents
 
 
    # range query step
    # optional, default is 1024
    #
    # sql_range_step        = 1000
 
 
    # unsigned integer attribute declaration
    # multi-value (an arbitrary number of attributes is allowed), optional
    # optional bit size can be specified, default is 32
    #
    # sql_attr_uint     = author_id
    # sql_attr_uint     = forum_id:9 # 9 bits for forum_id
    sql_attr_uint       = group_id
 
    # boolean attribute declaration
    # multi-value (an arbitrary number of attributes is allowed), optional
    # equivalent to sql_attr_uint with 1-bit size
    #
    # sql_attr_bool     = is_deleted
 
 
    # bigint attribute declaration
    # multi-value (an arbitrary number of attributes is allowed), optional
    # declares a signed (unlike uint!) 64-bit attribute
    #
    # sql_attr_bigint       = my_bigint_id
 
 
    # UNIX timestamp attribute declaration
    # multi-value (an arbitrary number of attributes is allowed), optional
    # similar to integer, but can also be used in date functions
    #
    # sql_attr_timestamp    = posted_ts
    # sql_attr_timestamp    = last_edited_ts
    sql_attr_timestamp  = date_added
 
    # string ordinal attribute declaration
    # multi-value (an arbitrary number of attributes is allowed), optional
    # sorts strings (bytewise), and stores their indexes in the sorted list
    # sorting by this attr is equivalent to sorting by the original strings
    #
    # sql_attr_str2ordinal  = author_name
 
 
    # floating point attribute declaration
    # multi-value (an arbitrary number of attributes is allowed), optional
    # values are stored in single precision, 32-bit IEEE 754 format
    #
    # sql_attr_float        = lat_radians
    # sql_attr_float        = long_radians
 
 
    # multi-valued attribute (MVA) attribute declaration
    # multi-value (an arbitrary number of attributes is allowed), optional
    # MVA values are variable length lists of unsigned 32-bit integers
    #
    # syntax is ATTR-TYPE ATTR-NAME ‘from‘ SOURCE-TYPE [;QUERY] [;RANGE-QUERY]
    # ATTR-TYPE is ‘uint‘ or ‘timestamp‘
    # SOURCE-TYPE is ‘field‘, ‘query‘, or ‘ranged-query‘
    # QUERY is SQL query used to fetch all ( docid, attrvalue ) pairs
    # RANGE-QUERY is SQL query used to fetch min and max ID values, similar to ‘sql_query_range‘
    #
    # sql_attr_multi        = uint tag from query; SELECT docid, tagid FROM tags
    # sql_attr_multi        = uint tag from ranged-query; \
    #   SELECT docid, tagid FROM tags WHERE id>=$start AND id<=$end; \
    #   SELECT MIN(docid), MAX(docid) FROM tags
 
 
    # string attribute declaration
    # multi-value (an arbitrary number of these is allowed), optional
    # lets you store and retrieve strings
    #
    # sql_attr_string       = stitle
 
 
    # wordcount attribute declaration
    # multi-value (an arbitrary number of these is allowed), optional
    # lets you count the words at indexing time
    #
    # sql_attr_str2wordcount    = stitle
 
 
    # combined field plus attribute declaration (from a single column)
    # stores column as an attribute, but also indexes it as a full-text field
    #
    # sql_field_string  = author
    # sql_field_str2wordcount   = title
 
     
    # post-query, executed on sql_query completion
    # optional, default is empty
    #
    # sql_query_post        =
 
     
    # post-index-query, executed on successful indexing completion
    # optional, default is empty
    # $maxid expands to max document ID actually fetched from DB
    #
    # sql_query_post_index  = REPLACE INTO counters ( id, val ) \
    #   VALUES ( ‘max_indexed_id‘, $maxid )
 
 
    # ranged query throttling, in milliseconds
    # optional, default is 0 which means no delay
    # enforces given delay before each query step
    sql_ranged_throttle = 0
 
    # document info query, ONLY for CLI search (ie. testing and debugging)
    # optional, default is empty
    # must contain $id macro and must fetch the document by that id
    sql_query_info      = SELECT * FROM documents WHERE id=$id
 
    # kill-list query, fetches the document IDs for kill-list
    # k-list will suppress matches from preceding indexes in the same query
    # optional, default is empty
    #
    # sql_query_killlist    = SELECT id FROM documents WHERE edited>=@last_reindex
 
 
    # columns to unpack on indexer side when indexing
    # multi-value, optional, default is empty list
    #
    # unpack_zlib       = zlib_column
    # unpack_mysqlcompress  = compressed_column
    # unpack_mysqlcompress  = compressed_column_2
 
 
    # maximum unpacked length allowed in MySQL COMPRESS() unpacker
    # optional, default is 16M
    #
    # unpack_mysqlcompress_maxsize  = 16M
 
 
    #####################################################################
    ## xmlpipe2 settings
    #####################################################################
 
    # type          = xmlpipe
 
    # shell command to invoke xmlpipe stream producer
    # mandatory
    #
    # xmlpipe_command       = cat @CONFDIR@/test.xml
 
    # xmlpipe2 field declaration
    # multi-value, optional, default is empty
    #
    # xmlpipe_field     = subject
    # xmlpipe_field     = content
 
 
    # xmlpipe2 attribute declaration
    # multi-value, optional, default is empty
    # all xmlpipe_attr_XXX options are fully similar to sql_attr_XXX
    #
    # xmlpipe_attr_timestamp    = published
    # xmlpipe_attr_uint = author_id
 
 
    # perform UTF-8 validation, and filter out incorrect codes
    # avoids XML parser choking on non-UTF-8 documents
    # optional, default is 0
    #
    # xmlpipe_fixup_utf8    = 1
}
 
 
# inherited source example
#
# all the parameters are copied from the parent source,
# and may then be overridden in this source definition
source src1throttled : src1
{
    sql_ranged_throttle = 100
}
 
#############################################################################
## index definition
#############################################################################
 
# local index example
#
# this is an index which is stored locally in the filesystem
#
# all indexing-time options (such as morphology and charsets)
# are configured per local index
index test1
{
    # index type
    # optional, default is ‘plain‘
    # known values are ‘plain‘, ‘distributed‘, and ‘rt‘ (see samples below)
    # type          = plain
 
    # document source(s) to index
    # multi-value, mandatory
    # document IDs must be globally unique across all sources
    source          = src1
 
    # index files path and file name, without extension
    # mandatory, path must be writable, extensions will be auto-appended
    #path           = @CONFDIR@/data/test1
    # 放索引的目录
    path = E:/webserver/sphinx/data
    # document attribute values (docinfo) storage mode
    # optional, default is ‘extern‘
    # known values are ‘none‘, ‘extern‘ and ‘inline‘
    docinfo         = extern
 
    # memory locking for cached data (.spa and .spi), to prevent swapping
    # optional, default is 0 (do not mlock)
    # requires searchd to be run from root
    mlock           = 0
 
    # a list of morphology preprocessors to apply
    # optional, default is empty
    #
    # builtin preprocessors are ‘none‘, ‘stem_en‘, ‘stem_ru‘, ‘stem_enru‘,
    # ‘soundex‘, and ‘metaphone‘; additional preprocessors available from
    # libstemmer are ‘libstemmer_XXX‘, where XXX is algorithm code
    # (see libstemmer_c/libstemmer/modules.txt)
    #
    # morphology        = stem_en, stem_ru, soundex
    # morphology        = libstemmer_german
    # morphology        = libstemmer_sv
    morphology      = none
 
    # minimum word length at which to enable stemming
    # optional, default is 1 (stem everything)
    #
    # min_stemming_len  = 1
 
 
    # stopword files list (space separated)
    # optional, default is empty
    # contents are plain text, charset_table and stemming are both applied
    #
    # stopwords     = @CONFDIR@/data/stopwords.txt
 
 
    # wordforms file, in "mapfrom > mapto" plain text format
    # optional, default is empty
    #
    # wordforms     = @CONFDIR@/data/wordforms.txt
 
 
    # tokenizing exceptions file
    # optional, default is empty
    #
    # plain text, case sensitive, space insensitive in map-from part
    # one "Map Several Words => ToASingleOne" entry per line
    #
    # exceptions        = @CONFDIR@/data/exceptions.txt
 
 
    # minimum indexed word length
    # default is 1 (index everything)
    min_word_len        = 1
 
    # charset encoding type
    # optional, default is ‘sbcs‘
    # known types are ‘sbcs‘ (Single Byte CharSet) and ‘utf-8‘
    # 编码
    #charset_type       = sbcs
    charset_type     = utf-8
     
    # charset definition and case folding rules "table"
    # optional, default value depends on charset_type
    #
    # defaults are configured to include English and Russian characters only
    # you need to change the table to include additional ones
    # this behavior MAY change in future versions
    #
    # ‘sbcs‘ default value is
    # charset_table     = 0..9, A..Z->a..z, _, a..z, U+A8->U+B8, U+B8, U+C0..U+DF->U+E0..U+FF, U+E0..U+FF
    #指定utf-8 的编码表
    charset_table       = 0..9, A..Z->a..z, _, a..z, U+A8->U+B8, U+B8, U+C0..U+DF->U+E0..U+FF, U+E0..U+FF
    #
    # ‘utf-8‘ default value is
    # charset_table     = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F
 
 
    # ignored characters list
    # optional, default value is empty
    #
    # ignore_chars      = U+00AD
 
 
    # minimum word prefix length to index
    # optional, default is 0 (do not index prefixes)
    #
    # min_prefix_len        = 0
 
 
    # minimum word infix length to index
    # optional, default is 0 (do not index infixes)
    #
    # min_infix_len     = 0
 
 
    # list of fields to limit prefix/infix indexing to
    # optional, default value is empty (index all fields in prefix/infix mode)
    #
    # prefix_fields     = filename
    # infix_fields      = url, domain
 
 
    # enable star-syntax (wildcards) when searching prefix/infix indexes
    # search-time only, does not affect indexing, can be 0 or 1
    # optional, default is 0 (do not use wildcard syntax)
    #
    # enable_star       = 1
 
 
    # expand keywords with exact forms and/or stars when searching fit indexes
    # search-time only, does not affect indexing, can be 0 or 1
    # optional, default is 0 (do not expand keywords)
    #
    # expand_keywords       = 1
 
     
    # n-gram length to index, for CJK indexing
    # only supports 0 and 1 for now, other lengths to be implemented
    # optional, default is 0 (disable n-grams)
    #简单分词,只支持0 和1 ,如果要搜索中文,请指定为1
    # ngram_len     = 1
    ngram_len       = 1
 
    # n-gram characters list, for CJK indexing
    # optional, default is empty
    #
    # 需要分词的字符,如果要搜索中文,去掉前面的注释
    # ngram_chars       = U+3000..U+2FA1F
    ngram_chars     = U+3000..U+2FA1F
 
    # phrase boundary characters list
    # optional, default is empty
    #
    # phrase_boundary       = ., ?, !, U+2026 # horizontal ellipsis
 
 
    # phrase boundary word position increment
    # optional, default is 0
    #
    # phrase_boundary_step  = 100
 
 
    # blended characters list
    # blended chars are indexed both as separators and valid characters
    # for instance, AT&T will results in 3 tokens ("at", "t", and "at&t")
    # optional, default is empty
    #
    # blend_chars       = +, &, U+23
 
 
    # blended token indexing mode
    # a comma separated list of blended token indexing variants
    # known variants are trim_none, trim_head, trim_tail, trim_both, skip_pure
    # optional, default is trim_none
    #
    # blend_mode        = trim_tail, skip_pure
 
 
    # whether to strip HTML tags from incoming documents
    # known values are 0 (do not strip) and 1 (do strip)
    # optional, default is 0
    html_strip      = 0
 
    # what HTML attributes to index if stripping HTML
    # optional, default is empty (do not index anything)
    #
    # html_index_attrs  = img=alt,title; a=title;
 
 
    # what HTML elements contents to strip
    # optional, default is empty (do not strip element contents)
    #
    # html_remove_elements  = style, script
 
 
    # whether to preopen index data files on startup
    # optional, default is 0 (do not preopen), searchd-only
    #
    # preopen           = 1
 
 
    # whether to keep dictionary (.spi) on disk, or cache it in RAM
    # optional, default is 0 (cache in RAM), searchd-only
    #
    # ondisk_dict       = 1
 
 
    # whether to enable in-place inversion (2x less disk, 90-95% speed)
    # optional, default is 0 (use separate temporary files), indexer-only
    #
    # inplace_enable        = 1
 
 
    # in-place fine-tuning options
    # optional, defaults are listed below
    #
    # inplace_hit_gap       = 0 # preallocated hitlist gap size
    # inplace_docinfo_gap   = 0 # preallocated docinfo gap size
    # inplace_reloc_factor  = 0.1 # relocation buffer size within arena
    # inplace_write_factor  = 0.1 # write buffer size within arena
 
 
    # whether to index original keywords along with stemmed versions
    # enables "=exactform" operator to work
    # optional, default is 0
    #
    # index_exact_words = 1
 
 
    # position increment on overshort (less that min_word_len) words
    # optional, allowed values are 0 and 1, default is 1
    #
    # overshort_step        = 1
 
 
    # position increment on stopword
    # optional, allowed values are 0 and 1, default is 1
    #
    # stopword_step     = 1
 
 
    # hitless words list
    # positions for these keywords will not be stored in the index
    # optional, allowed values are ‘all‘, or a list file name
    #
    # hitless_words     = all
    # hitless_words     = hitless.txt
 
 
    # detect and index sentence and paragraph boundaries
    # required for the SENTENCE and PARAGRAPH operators to work
    # optional, allowed values are 0 and 1, default is 0
    #
    # index_sp          = 1
 
 
    # index zones, delimited by HTML/XML tags
    # a comma separated list of tags and wildcards
    # required for the ZONE operator to work
    # optional, default is empty string (do not index zones)
    #
    # index_zones       = title, h*, th
}
 
 
# inherited index example
#
# all the parameters are copied from the parent index,
# and may then be overridden in this index definition
#index test1stemmed : test1
#{
#   path            = @CONFDIR@/data/test1stemmed
#   morphology      = stem_en
#}
 
 
# distributed index example
#
# this is a virtual index which can NOT be directly indexed,
# and only contains references to other local and/or remote indexes
#index dist1
#{
#   # ‘distributed‘ index type MUST be specified
#   type            = distributed
#
#   # local index to be searched
#   # there can be many local indexes configured
#   local           = test1
#   local           = test1stemmed
#
#   # remote agent
#   # multiple remote agents may be specified
#   # syntax for TCP connections is ‘hostname:port:index1,[index2[,...]]‘
#   # syntax for local UNIX connections is ‘/path/to/socket:index1,[index2[,...]]‘
#   agent           = localhost:9313:remote1
#   agent           = localhost:9314:remote2,remote3
#   # agent         = /var/run/searchd.sock:remote4
#
#   # blackhole remote agent, for debugging/testing
#   # network errors and search results will be ignored
#   #
#   # agent_blackhole       = testbox:9312:testindex1,testindex2
#
#
#   # remote agent connection timeout, milliseconds
#   # optional, default is 1000 ms, ie. 1 sec
#   agent_connect_timeout   = 1000
#
#   # remote agent query timeout, milliseconds
#   # optional, default is 3000 ms, ie. 3 sec
#   agent_query_timeout = 3000
#}
 
 
# realtime index example
#
# you can run INSERT, REPLACE, and DELETE on this index on the fly
# using MySQL protocol (see ‘listen‘ directive below)
index rt
{
    # ‘rt‘ index type must be specified to use RT index
    type            = rt
 
    # index files path and file name, without extension
    # mandatory, path must be writable, extensions will be auto-appended
    #path           = @CONFDIR@/data/rt
    path            = E:/webserver/sphinx/data/rt
 
 
    # RAM chunk size limit
    # RT index will keep at most this much data in RAM, then flush to disk
    # optional, default is 32M
    #
    # rt_mem_limit      = 512M
 
    # full-text field declaration
    # multi-value, mandatory
    rt_field        = title
    rt_field        = content
 
    # unsigned integer attribute declaration
    # multi-value (an arbitrary number of attributes is allowed), optional
    # declares an unsigned 32-bit attribute
    rt_attr_uint        = gid
 
    # RT indexes currently support the following attribute types:
    # uint, bigint, float, timestamp, string
    #
    # rt_attr_bigint        = guid
    # rt_attr_float     = gpa
    # rt_attr_timestamp = ts_added
    # rt_attr_string        = author
}
 
#############################################################################
## indexer settings
#############################################################################
 
indexer
{
    # memory limit, in bytes, kiloytes (16384K) or megabytes (256M)
    # optional, default is 32M, max is 2047M, recommended is 256M to 1024M
    mem_limit       = 32M
 
    # maximum IO calls per second (for I/O throttling)
    # optional, default is 0 (unlimited)
    #
    # max_iops      = 40
 
 
    # maximum IO call size, bytes (for I/O throttling)
    # optional, default is 0 (unlimited)
    #
    # max_iosize        = 1048576
 
 
    # maximum xmlpipe2 field length, bytes
    # optional, default is 2M
    #
    # max_xmlpipe2_field    = 4M
 
 
    # write buffer size, bytes
    # several (currently up to 4) buffers will be allocated
    # write buffers are allocated in addition to mem_limit
    # optional, default is 1M
    #
    # write_buffer      = 1M
 
 
    # maximum file field adaptive buffer size
    # optional, default is 8M, minimum is 1M
    #
    # max_file_field_buffer = 32M
}
 
# 搜索服务需要修改的部分
#############################################################################
## searchd settings
#############################################################################
 
searchd
{
    # [hostname:]port[:protocol], or /unix/socket/path to listen on
    # known protocols are ‘sphinx‘ (SphinxAPI) and ‘mysql41‘ (SphinxQL)
    #
    # multi-value, multiple listen points are allowed
    # optional, defaults are 9312:sphinx and 9306:mysql41, as below
    #
    # listen            = 127.0.0.1
    # listen            = 192.168.0.1:9312
    # listen            = 9312
    # listen            = /var/run/searchd.sock
    listen          = 9312
    listen          = 9306:mysql41
 
    # log file, searchd run info is logged here
    # optional, default is ‘searchd.log‘
    # 日志
    #log            = @CONFDIR@/log/searchd.log
 
    log        = E:/webserver/sphinx/log/searchd.log
    # query log file, all search queries are logged here
    # optional, default is empty (do not log queries)
    #query_log      = @CONFDIR@/log/query.log
    query_log       = E:/webserver/sphinx/log/query.log
    # client read timeout, seconds
    # optional, default is 5
    read_timeout        = 5
 
    # request timeout, seconds
    # optional, default is 5 minutes
    client_timeout      = 300
 
    # maximum amount of children to fork (concurrent searches to run)
    # optional, default is 0 (unlimited)
    max_children        = 30
 
    # PID file, searchd process ID file name
    # mandatory
    #pid_file       = @CONFDIR@/log/searchd.pid
    pid_file    =  E:/webserver/sphinx/log/searchd.pid
 
 
    # max amount of matches the daemon ever keeps in RAM, per-index
    # WARNING, THERE‘S ALSO PER-QUERY LIMIT, SEE SetLimits() API CALL
    # default is 1000 (just like Google)
    max_matches     = 1000
 
    # seamless rotate, prevents rotate stalls if precaching huge datasets
    # optional, default is 1
    # windows 下启动searchd 服务一定要注释掉这个
    #seamless_rotate        = 1
 
    # whether to forcibly preopen all indexes on startup
    # optional, default is 1 (preopen everything)
    preopen_indexes     = 1
 
    # whether to unlink .old index copies on succesful rotation.
    # optional, default is 1 (do unlink)
    unlink_old      = 1
 
    # attribute updates periodic flush timeout, seconds
    # updates will be automatically dumped to disk this frequently
    # optional, default is 0 (disable periodic flush)
    #
    # attr_flush_period = 900
 
 
    # instance-wide ondisk_dict defaults (per-index value take precedence)
    # optional, default is 0 (precache all dictionaries in RAM)
    #
    # ondisk_dict_default   = 1
 
 
    # MVA updates pool size
    # shared between all instances of searchd, disables attr flushes!
    # optional, default size is 1M
    mva_updates_pool    = 1M
 
    # max allowed network packet size
    # limits both query packets from clients, and responses from agents
    # optional, default size is 8M
    max_packet_size     = 8M
 
    # crash log path
    # searchd will (try to) log crashed query to ‘crash_log_path.PID‘ file
    # optional, default is empty (do not create crash logs)
    #
    # crash_log_path        = @CONFDIR@/log/crash
 
 
    # max allowed per-query filter count
    # optional, default is 256
    max_filters     = 256
 
    # max allowed per-filter values count
    # optional, default is 4096
    max_filter_values   = 4096
 
 
    # socket listen queue length
    # optional, default is 5
    #
    # listen_backlog        = 5
 
 
    # per-keyword read buffer size
    # optional, default is 256K
    #
    # read_buffer       = 256K
 
 
    # unhinted read size (currently used when reading hits)
    # optional, default is 32K
    #
    # read_unhinted     = 32K
 
 
    # max allowed per-batch query count (aka multi-query count)
    # optional, default is 32
    max_batch_queries   = 32
 
 
    # max common subtree document cache size, per-query
    # optional, default is 0 (disable subtree optimization)
    #
    # subtree_docs_cache    = 4M
 
 
    # max common subtree hit cache size, per-query
    # optional, default is 0 (disable subtree optimization)
    #
    # subtree_hits_cache    = 8M
 
 
    # multi-processing mode (MPM)
    # known values are none, fork, prefork, and threads
    # optional, default is fork
    #
    workers         = threads # for RT to work
 
 
    # max threads to create for searching local parts of a distributed index
    # optional, default is 0, which means disable multi-threaded searching
    # should work with all MPMs (ie. does NOT require workers=threads)
    #
    # dist_threads      = 4
 
 
    # binlog files path; use empty string to disable binlog
    # optional, default is build-time configured data directory
    #
    # binlog_path       = # disable logging
    # binlog_path       = @CONFDIR@/data # binlog.001 etc will be created there
 
 
    # binlog flush/sync mode
    # 0 means flush and sync every second
    # 1 means flush and sync every transaction
    # 2 means flush every transaction, sync every second
    # optional, default is 2
    #
    # binlog_flush      = 2
 
 
    # binlog per-file size limit
    # optional, default is 128M, 0 means no limit
    #
    # binlog_max_log_size   = 256M
 
 
    # per-thread stack size, only affects workers=threads mode
    # optional, default is 64K
    #
    # thread_stack          = 128K
 
 
    # per-keyword expansion limit (for dict=keywords prefix searches)
    # optional, default is 0 (no limit)
    #
    # expansion_limit       = 1000
 
 
    # RT RAM chunks flush period
    # optional, default is 0 (no periodic flush)
    #
    # rt_flush_period       = 900
 
 
    # query log file format
    # optional, known values are plain and sphinxql, default is plain
    #
    # query_log_format      = sphinxql
 
 
    # version string returned to MySQL network protocol clients
    # optional, default is empty (use Sphinx version)
    #
    # mysql_version_string  = 5.0.37
 
 
    # trusted plugin directory
    # optional, default is empty (disable UDFs)
    #
    # plugin_dir            = /usr/local/sphinx/lib
 
 
    # default server-wide collation
    # optional, default is libc_ci
    #
    # collation_server      = utf8_general_ci
 
 
    # server-wide locale for libc based collations
    # optional, default is C
    #
    # collation_libc_locale = ru_RU.UTF-8
 
 
    # threaded server watchdog (only used in workers=threads mode)
    # optional, values are 0 and 1, default is 1 (watchdog on)
    #
    # watchdog              = 1
 
     
    # SphinxQL compatibility mode (legacy columns and their names)
    # optional, default is 1 (old-style)
    #
    # compat_sphinxql_magics    = 1
}
 
# --eof--

如果想作分词搜索的话,请继续往下看。

使用 CORESEEK 分词:

1 、下载 http://www.coreseek.cn/products/ft_down/ (Coreseek 3.2.13 wind32即可,虽然我是win8 64位,可以向下兼容)。

2 、安装系统依赖的软件包。

系统的基础组件需要如下的软件包:
– Active Python 2.5 ( http://www.activestate.com/Products/activepython/ )
– MySQL_Python 1.2.2 (http://sourceforge.net/project/showfiles.php?group_id=22307 ) 验证 可以不装

安装完前面两个组件后,系统可以运行,但是需要手工修改配置文件。

安装配置界面需要的软件包:
– gtk-dev 2.12.9 (http://sourceforge.net/project/showfiles.php?group_id=98754 ) 验证可以不装
– pycairo 1.4.12 (http://ftp.acc.umu.se/pub/GNOME/binaries/win32/pycairo/1.4/ )
– pygobject 2.14.1 (http://ftp.acc.umu.se/pub/GNOME/binaries/win32/pygobject/2.14/ )
– pygtk 2.12.1 (http://ftp.acc.umu.se/pub/GNOME/binaries/win32/pygtk/2.12 ) )

如果您下载的是完整版,前面提到的全部文件应该能在preq 子目录中找到。
安装前面提到的全部软件包(注意:必须先安装Python 和gtk)
注意: 必须是Active Python ,Python 官方的版本缺少系统需要的Win32 扩展支持,将导致系统无法工作。
注意: 完成本步后,必须重新启动您的计算机。

3 、解压 csft 到你认为的目录

4 、csft 文件内配与 sphinx 的内容大致相同 ( 配置详细见:sphinx+mysql (1) , (2) )

5 、创建词典文件

\bin\mmseg -u \data\unigram.txt # 词库是动态的,指定目录就可以

·把生成的文件改名为uni.lib ,

6 、导入sample.sql 数据库

7 、建立索引 index.exe –all (详情见 sphinx + mysql(1) )

———————————————————————-
以下分支说明如下:
———————————————————————-

A :

8 、安装SPHINXSE FOR MYSQL

http://www.sphinxsearch.com/downloads/mysql-5.0.45-sphinxse-0.9.8-win32.zip

下载后,解压然后覆盖MYSQL 目录,就OK 。 ( 注意mysql 版本 必须相同)

进入mysql 运行 show engines; 查看表的类型是否存在 sphinx

9 、创建Sphinx 存储引擎表

CREATE TABLE `sphinx` (

`id` int(11) NOT NULL,

`weight` int(11) NOT NULL,

`query` varchar(255) NOT NULL,

`group_id` int(11) NOT NULL,

KEY `Query` (`Query`)

) ENGINE=SPHINX CONNECTION=’sphinx://localhost:3312/test1′;

与一般mysql 表不同的是ENGINE=SPHINX CONNECTION=’sphinx://localhost:3312/test1′; ,这里表示这个表采用SPHINXSE 引擎,与sphinx 的连接串是’sphinx://localhost:3312/test1 ,test1 是索引名称

根据sphinx 官方说明,这个表必须至少有三个字段,字段起什么名称无所谓,但类型的顺序必须是integer,integer,varchar ,分别表示记录标识document ID, 匹配权重weight 与查询query ,同时document ID 与query 必须建索引。另外这个表还可以建立几个字段,这几个字段的只能是integer 或TIMESTAMP 类型,字段是与sphinx 的结果集绑定的,因此字段的名称必须与在sphinx.conf 中定义的属性名称一致,否则取出来的将是Null 值。

10 、MySQL SphinxSE 全文检索存储引擎SQL 语句使用方法

安装SphinxSE 存储引擎后首先需新建一张特殊的指定”ENGINE=SPHINX” 检索表,如下:

CREATE TABLE ArticleFulltext (

ID INTEGER NOT NULL,

Weight INTEGER NOT NULL,

Query VARCHAR(3072) NOT NULL,

INDEX (Query)

) ENGINE=SPHINX CONNECTION=”sphinx://localhost:3312/test”;

·其中表名和字段名可以是任意名称,但前三个属性的类型必须为INT 、INT 和VARCHAR 。也可以拥有更多的属性,类型必须为INT 或TIMESTAMP ,名称必须与Sphinx 配置文件对应,用于返回检索结果的更多信息。

·创建该表后即可使用如下的SQL 语句在MySQL 中进行全文检索:

·SELECT * FROM ArticleFulltext WHERE Query=’ 全文检索条件’;

·查询返回结果即为全文检索的结果,包括文档ID 、权重,若ArticleFulltext 表包含了更多属性还包含命中结果的其它信息。

·通过SQL 联接操作可以很容易的实现融合检索,如:

·SELECT ID, Title

FROM Article, ArticleFulltext

WHERE ArticleFulltext.ID = Article.ID and Query = ‘ 博客’

AND PublishTime > ’2007-03-01′ AND ReferCount > 0

ORDER BY Weight * 0.5 + ReferCount * 0.5;

·上述SQL 语句即可检索出2007 年3 月1 日以来包含’ 博客’ 关键字且并引用过的文章,且按全文检索权重和引用数综合计算所得的权重进行排序。

·由此可见,通过将全文检索系统提供的功能以存储引擎的形式嵌入到关系数据库MySQL 中可以很方便的提供融合检索功能,虽然功能限制较多,也不失为一种聪明便捷的方式。

11, 将SPHINX 生成WINDOWS 服务

searchd –install –config “csft.conf”

12. 启动服务 net start |searchd( 或者其它服务名)

—————————————————————

B :

配置sphinx.conf 文件,支持中文编码

charset_type = zh_cn.utf-8

charset_dictpath = D:\csft3.1\bin # 分词 lib 库文件的目录

min_infix_len = 0

以上资料均全部经受安装并测试,图没贴全,仅供参考

Sphinx在windows下安装使用(支持中文全文检索)

标签:

原文地址:http://www.cnblogs.com/yueryuermaomao/p/4658482.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!