码迷,mamicode.com
首页 > 其他好文 > 详细

FullText Index5: fundamental component

时间:2016-06-26 19:49:04      阅读:351      评论:0      收藏:0      [点我收藏+]

标签:

在SQL Server 2012中,使用FullText search能够实现single-term 或 phrase 的快速查询,主要有以下 fundamental component构成: 

1,Service:SQL Full-text Filter Daemon Launcher

Service to launch full-text filter daemon process which will perform document filtering and word breaking for SQL Server full-text search. Disabling this service will make full-text search features of SQL Server unavailable.

2,Word Breaker

单词分割,根据语法规则,判断单词分界的位置,用于识别语句中的单词;word Breaker在拆分单词时,还会记录每个word在字符串中的位置。

例如,"Kitty is a cute cat.",这个语句可以拆分成 四个单词:Kitty,is,a,cute,cat。 “Kitty” 和 “cat”的Position 分别是1 和 5,通过Fulltext index,能够查询出匹配两个相距一定距离的Phrase。查询语句:contains(Column_Name, ‘near((Kitty,cate),3)‘) 的含义是存在两个word,“Kitty” 和 “cate”,其最大距离是3,从Column_Name中查询出包含该Stemmer的phrase,字符串“Kitty is a cute cat.”  满足匹配条件。

3,StopList

停用字词表,非索引字表

4,Stemmer 和 Thesaurus

Stemmer  是词干分析器,A stemmer extracts the root form of a given word. 

Thesaurus 是 同义词词典

二,Work Breaker

用于将column中的字符串,按照delimiter,将字符串拆分成单个单词。

1,使用 sys.dm_fts_parser DMF查看字符串拆分的结果。

sys.dm_fts_parser(query_string, lcid, stoplist_id, accent_sensitivity)

Returns the final tokenization result after applying a given word breaker, thesaurus, and stoplist combination to a query string input. The tokenization result is equivalent to the output of the Full-Text Engine for the specified query string.

stoplist_id                               

ID of the stoplist, if any, to be used by the word breaker identified by lcid. stoplist_id is int. If you specify ‘NULL‘, no stoplist is used. If you specify 0, the system STOPLIST is used.

例如:查看字符串"Kitty is a cute cat"拆分的Words。

select *
from sys.dm_fts_parser(N"Kitty is a cute cat",1033,0,0) as p 

Display_term 是拆分之后的word,keyword 是 hexadecimal representation,这两个column以不同方式呈现同一个term。

occurrence:在将字符串拆分之后,occurrence表示每个word的position, Indicates the order of each term in the parsing result.

special_term:如果值是Noise Word,说明该term是StopList中的一个字符。Exact Match是拆分之后的字符。

技术分享

三,StopList

StopList 是停用词列表,这些word是常用单词,例如,“a”,“and”.etc,对这些word的search没有意义,在创建Fulltext Index时,SQL Server 将StopList中的word舍弃,避免fti过大。

参考《Configure and Manage Stopwords and Stoplists for Full-Text Search》:

To prevent a full-text index from becoming bloated, SQL Server has a mechanism that discards commonly occurring strings that do not help the search. These discarded strings are called stopwords. During index creation, the Full-Text Engine omits stopwords from the full-text index. This means that full-text queries will not search on stopwords.

1,Understanding Stopwords and Stoplists      

A stopword can be a word with meaning in a specific language, or it can be a token that does not have linguistic meaning. For example, in the English language, words such as "a," "and," "is," and "the" are left out of the full-text index since they are known to be useless to a search. 

Although it ignores the inclusion of stopwords, the full-text index does take into account their position. For example, consider the phrase, "Instructions are applicable to these Adventure Works Cycles models". The following table depicts the position of the words in the phrase:

 

Word

Position

Instructions

1

are

2

applicable

3

to

4

these

5

Adventure

6

Works

7

Cycles

8

models

9

The stopwords "are", "to", and "these" that are in positions 2, 4, and 5 are left out of the full-text index. However, their positional information is maintained, thereby leaving the position of the other words in the phrase unaffected.

Stopwords are managed in databases using objects called stoplists. A stoplist is a list of stopwords that, when associated with a full-text index, is applied to full-text queries on that index.

2,创建StopList,向其中添加Stopwords

创建Stoplist的语法和添加stopwords的语法

CREATE FULLTEXT STOPLIST stoplist_name
[ FROM { [ database_name.]source_stoplist_name } | SYSTEM STOPLIST ]
[ AUTHORIZATION owner_name ]
;
ALTER FULLTEXT STOPLIST stoplist_name
{ 
   ADD [N] stopword LANGUAGE language_term  
  | DROP 
    {
        stopword LANGUAGE language_term 
      | ALL LANGUAGE language_term 
      | ALL
     }
};

创建StopList,向其中添加Stopwords

create fulltext stoplist stop_list_test;

alter fulltext stoplist stop_list_test
add Ncat language 1033;

3,查看StopList

使用 sys.fulltext_stoplists 和 sys.fulltext_stopwords 查看自定义的StopList 和 Stopwords,并使用 ys.dm_fts_parser 函数查看拆分之后的words。

select *
from sys.fulltext_stoplists
where name=Nstop_list_test;

select *
from sys.fulltext_stopwords
where stoplist_id=5;

select *
from sys.dm_fts_parser(N"Kitty is a cute cat",1033,5,0) as p ;

技术分享

4,查看SQL Server提供的system stoplist,对于English,共有 154 个 stopwords。

select *
from sys.fulltext_system_stopwords
where language_id=1033


四,Stemmer 和 Thesaurus

1,Stemmer 是动词的不同变化形式,这些单词都是同源的。Stemmer 也叫做 conjugating verbs,根据数、人称、时态等列举动词的变化形式。在contains clause中使用 FORMSOF ( INFLECTIONAL <simple_verb_term>) 来使用Stemmer。

A stemmer takes a word and generates inflectional forms, or conjugations. The example in Books Online, and an easy one to understand is “run”. There are various forms of "run” that we would want to consider as equivalent when performing a search. For example, you would want to consider:

  • ran
  • running
  • runs
  • runner (perhaps)

The same could be said for “lay”. That would generate

  • lie
  • laying
  • lain
  • lays

This is one of the big advantages over the LIKE predicate in that stemmers can match these forms of the word being searched for. The index would relate all of these to the core, base word.

2,Thesaurus 是同义词词典,例如,我们可以认为“database” 和 “DB” 是同义词,“Author” , “Writer” ,“journalist”是同义词等,SQL Server 使用XML文件来配置Thesaurus。

参考《Configure and Manage Thesaurus Files for Full-Text Search

In SQL Server, full-text queries can search for synonyms of user-specified terms through the use of a thesaurus. A SQL Server thesaurus defines a set of synonyms for a specific language. System administrators can define two forms of synonyms: expansion sets and replacement sets. By developing a thesaurus tailored to your full-text data, you can effectively broaden the scope of full-text queries on that data. Thesaurus matching occurs for all FREETEXT and FREETEXTABLE queries and for any CONTAINS and CONTAINSTABLE queries that specify the FORMSOF THESAURUS clause.

 

参考Doc:

sys.dm_fts_parser (Transact-SQL)

CREATE FULLTEXT STOPLIST (Transact-SQL)

ALTER FULLTEXT STOPLIST (Transact-SQL)

Configure and Manage Word Breakers and Stemmers for Search

 

FullText Index5: fundamental component

标签:

原文地址:http://www.cnblogs.com/ljhdo/p/5614054.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!