码迷,mamicode.com
首页 > 其他好文 > 详细

Fulltext Index Study3:Query

时间:2016-06-27 00:00:48      阅读:534      评论:0      收藏:0      [点我收藏+]

标签:

在query 语句中,可以使用 contains predicate来调用Fulltext Index,实现比like速度更快的查询。使用contains能够进行term的extract匹配查询或term的前缀匹配查询,还能够进行基于词根的steming查询,基于自定义同义词文件的synonym查询,基于距离和顺序的相邻term查询。和like 相比,contains不能进行后缀匹配查询。如果Fulltext Index 能够满足业务需求,那么Fulltext Index是一个非常不错的选择,跟like相比,其速度更快,并且支持formatted binary data的查询。

Contains Predicate

Searches for precise or fuzzy (less precise) matches to single words and phrases, words within a certain distance of one another, or weighted matches. CONTAINS is a predicate used in the WHERE clause of a Transact-SQL SELECT statement to perform SQL Server full-text search on full-text indexed columns containing character-based data types.

CONTAINS can search for: 

  • A word or phrase.

  • The prefix of a word or phrase.

  • A word near another word.

  • A word inflectionally generated from another (for example, the word drive is the inflectional stem of drives, drove, driving, and driven).

  • A word that is a synonym of another word using a thesaurus (for example, the word "metal" can have synonyms such as "aluminum" and "steel").

 一,Contains syntax

CONTAINS ( 
     { column_name | ( column_list ) | * } 
     , <contains_search_condition>
     [ , LANGUAGE language_term ]
   ) 
<contains_search_condition> ::= 
  { 
      <simple_term> 
    | <prefix_term> 
    | <generation_term> 
    | <generic_proximity_term> 
    | <custom_proximity_term> 
    | <weighted_term> 
    } 
  | 
    { ( <contains_search_condition> ) 
        [ { <AND> | <AND NOT> | <OR> } ] 
        <contains_search_condition> [ ...n ] 
  } 
<simple_term> ::= 
     { word | "phrase" }
<prefix term> ::= 
  { "word*" | "phrase*" }

word  : Is a string of characters without spaces or punctuation.

phrase : Is one or more words with spaces between each word.

对于Phrase,必须使用双引号,将多个word组成一个phrase。

1,使用Contains 进行single term 匹配

<simple_term>

Specifies a match for an exact word or a phrase. Examples of valid simple terms are "blue berry", blueberry, and "Microsoft SQL Server". Phrases should be enclosed in double quotation marks (""). Words in a phrase must appear in the same order as specified in <contains_search_condition> as they appear in the database column. The search for characters in the word or phrase is not case-sensitive. Noise words (or stopwords) (such as a, and, or the) in full-text indexed columns are not stored in the full-text index. If a noise word is used in a single word search, SQL Server returns an error message indicating that the query contains only noise words. SQL Server includes a standard list of noise words in the directory \Mssql\Binn\FTERef of each instance of SQL Server.

Punctuation is ignored. Therefore, CONTAINS(testing, "computer failure") matches a row with the value, "Where is my computer? Failure to find it would be expensive."

匹配单个word 或单个phrase

--word
DECLARE @SearchWord nvarchar(30)
SET @SearchWord = Nperformance
SELECT Description 
FROM Production.ProductDescription 
WHERE CONTAINS(Description, @SearchWord);

--phrase
DECLARE @SearchPhrase nvarchar(30)
SET @SearchPhrase = N"performance tuning"
SELECT Description 
FROM Production.ProductDescription 
WHERE CONTAINS(Description, @SearchPhrase);

使用and ,and not, 或 or 逻辑运算符 匹配多个word 或 多个phrase

SELECT Name
FROM Production.Product
WHERE CONTAINS(Name,  Mountain OR Road )

SELECT Name
FROM Production.Product
WHERE CONTAINS(Name, "Mountain" OR "Road" )

2,使用contains进行前缀匹配,和like ‘prefix%‘功能相同,只不过contains使用“*”作为通配符,“*”匹配0,1或多个字符,前缀匹配的写法是:‘"prefix*"‘,fulltext index只能进行前缀匹配。

<prefix_term>

Specifies a match of words or phrases beginning with the specified text. Enclose a prefix term in double quotation marks ("") and add an asterisk (*) before the ending quotation mark, so that all text starting with the simple term specified before the asterisk is matched. The clause should be specified this way: CONTAINS (column, ‘"text*"‘). The asterisk matches zero, one, or more characters (of the root word or words in the word or phrase). If the text and asterisk are not delimited by double quotation marks, so the predicate reads CONTAINS (column, ‘text*‘), full-text search considers the asterisk as a character and searches for exact matches to text*. The full-text engine will not find words with the asterisk (*) character because word breakers typically ignore such characters.

When <prefix_term> is a phrase, each word contained in the phrase is considered to be a separate prefix. Therefore, a query specifying a prefix term of "local wine*" matches any rows with the text of "local winery", "locally wined and dined", and so on.

SELECT Name
FROM Production.Product
WHERE CONTAINS(Name,  "Chain*" );

SELECT Name
FROM Production.Product
WHERE CONTAINS(Name, "chain*" OR "full*");

3,查询同义词(thesaurus)或词干(stemmer)

<generation_term> ::=  FORMSOF ( { INFLECTIONAL | THESAURUS } , <simple_term> [ ,...n ] )

INFLECTIONAL               

Specifies that the language-dependent stemmer is to be used on the specified simple term. Stemmer behavior is defined based on stemming rules of each specific language. The neutral language does not have an associated stemmer. The column language of the columns being queried is used to refer to the desired stemmer. If language_term is specified, the stemmer corresponding to that language is used.

A given <simple_term> within a <generation_term> will not match both nouns and verbs.

THESAURUS              

Specifies that the thesaurus corresponding to the column full-text language, or the language specified in the query is used. The longest pattern or patterns from the <simple_term> are matched against the thesaurus and additional terms are generated to expand or replace the original pattern. If a match is not found for all or part of the <simple_term>, the non-matching portion is treated as a simple_term.

Stemmer(词干),例如,根据语法规程,英语的动词 根据数(单数,复数),人称,时态的不同而存在不同的变化形式,这些单词都是同源的。

--searches for all products with words of the form ride: "riding," "ridden," and so on.
SELECT Description
FROM Production.ProductDescription
WHERE CONTAINS(Description,  FORMSOF (INFLECTIONAL, ride) );

THESAURUS (同义词),需要导入XML进行配置,SQL Server 提供一个默认的Thesaurus file,是Empty的。如果在Thesaurus file 配置“Author”,“Writer”,“journalist” 是同义词,在使用fulltext index查询时,只要满足任意一个同义词,都匹配成功。

SELECT Description
FROM Production.ProductDescription
WHERE CONTAINS(Description,  FORMSOF (THESAURUS, author) );

4,proximity_term,使用 near 关键字(或~操作符),查询words相邻的数据行

<generic_proximity_term> ::= 
  { <simple_term> | <prefix_term> } { { { NEAR | ~ } 
     { <simple_term> | <prefix_term> } } [ ...n ] }

<custom_proximity_term> ::= 
  NEAR ( 
     {
        { <simple_term> | <prefix_term> } [ ,…n ]
     |
        ( { <simple_term> | <prefix_term> } [ ,…n ] ) 
      [, <maximum_distance> 
[, <match_order>
] ] } ) <maximum_distance> ::= { integer | MAX } <match_order> ::= { TRUE | FALSE }

NEAR | ~                  

Indicates that the word or phrase on each side of the NEAR or ~ operator must occur in a document for a match to be returned. You must specify two search terms. A given search term can be either a single word or a phrase that is delimited by double quotation marks ("phrase").

Several proximity terms can be chained, as in a NEAR b NEAR c or a ~ b ~ c. Chained proximity terms must all be in the document for a match to be returned.

For example, CONTAINS(column_name, ‘fox NEAR chicken‘) and CONTAINSTABLE(table_name, column_name, ‘fox ~ chicken‘) would both return any documents in the specified column that contain both "fox" and "chicken". In addition, CONTAINSTABLE returns a rank for each document based on the proximity of "fox" and "chicken". For example, if a document contains the sentence, "The fox ate the chicken," its ranking would be high because the terms are closer to one another than in other documents.

使用Near operator 来表示多个words同时存在。

Near operator没有顺序性,例如:contains(column_name,‘fox near chickedn‘),表示两个word “fox”和“chicken”都同时存在于一个数据行中,并不表示 “fox” 存在于 “chicken”的前面。“fox eats chicken” 或 “Chicken run when looking a weak fox” 都匹配。

使用Near 函数制定words相邻的距离和匹配顺序,near((term1,term2,term3),5)表示任意两个term的距离不能超过5, near((term1,term2,term3),5,true),表示任意两个term的距离不能超过5,并且按照 term1,term2,term3的顺序存在于字符串中。

<custom_proximity_term>

Specifies a match of words or phrases, and optionally, the maximum distance allowed between search terms. you can also specify that search terms must be found in the exact order in which you specify them (<match_order>).

--regardless of the intervening distance and regardless of order
CONTAINS(column_name, NEAR(term1,"term3 term4"))

--searches for "AA" and "BB", in either order, within a maximum distance of five
CONTAINS(column_name, NEAR((AA,BB),5))--in the specified order with regardless of the distance
CONTAINS(column_name, NEAR ((Monday, Tuesday, Wednesday), MAX, TRUE))

对于 near((term1,term2,term3),5,true),term1 和 term5之间最多存在5个term,不包括 inner search term,“term2”,例如:

CONTAINS(column_name, NEAR((AA,BB,CC),5))

This query would match the following string, in which the total distance is five,Notice that the inner search term, "CC", is not counted.

BB one two CC three four five AA

The following example searches the Production.ProductReview table for all comments that contain the word "bike" within 10 terms of the word "control" and in the specified order (that is, where "bike" precedes "control").

SELECT Comments
FROM Production.ProductReview
WHERE CONTAINS(Comments , NEAR((bike,control), 10, TRUE));

二,Comparison of LIKE to Full-Text Search  

In contrast to full-text search, the LIKE Transact-SQL predicate works on character patterns only. Also, you cannot use the LIKE predicate to query formatted binary data. Furthermore, a LIKE query against a large amount of unstructured text data is much slower than an equivalent full-text query against the same data. A LIKE query against millions of rows of text data can take minutes to return; whereas a full-text query can take only seconds or less against the same data, depending on the number of rows that are returned and their size. Another consideration is that LIKE performs only a simple pattern scan of an entire table. A full-text query, in contrast, is language aware, applying specific transformations at index and query time, such as filtering stopwords and making thesaurus and inflectional expansions. These transformations help full-text queries improve their recall and the final ranking of their results.

Appendix:FreeText Syntax

当使用Freetext predicate进行fulltext search时,SQL Server自动进行word break,词干和同义词的查询匹配,使用FreeText Predicate不能得到精确的匹配查询。

Full-text queries using FREETEXT are less precise than those full-text queries using CONTAINS. The SQL Server full-text search engine identifies important words and phrases.

This predicate searches for values that match the meaning and not just the exact wording of the words in the search condition. When FREETEXT is used, the full-text query engine internally performs the following actions on the freetext_string, assigns each term a weight, and then finds the matches: 

  • Separates the string into individual words based on word boundaries (word-breaking).

  • Generates inflectional forms of the words (stemming).

  • Identifies a list of expansions or replacements for the terms based on matches in the thesaurus.

FREETEXT ( { column_name | (column_list) | * } 
          , freetext_string[ , LANGUAGE language_term ] )

freetext_string 

Is text to search for in the column_name. Any text, including words, phrases or sentences, can be entered. Matches are generated if any term or the forms of any term is found in the full-text index.

Unlike in the CONTAINS and CONTAINSTABLE search condition where AND is a keyword, when used in freetext_string the word ‘and‘ is considered a noise word, or stopword, and will be discarded.

Use of WEIGHT, FORMSOF, wildcards, NEAR and other syntax is not allowed. freetext_string is wordbroken, stemmed, and passed through the thesaurus.

A. Using FREETEXT to search for words containing specified character values

The following example searches for all documents containing the words related to vital, safety, components.

SELECT Title
FROM Production.Document
WHERE FREETEXT (Document, vital safety components‘ );

B. Using FREETEXT with variables

DECLARE @SearchWord nvarchar(30);
SET @SearchWord = Nhigh-performance;
SELECT Description 
FROM Production.ProductDescription 
WHERE FREETEXT(Description, @SearchWord);

 

参考doc:

CONTAINS (Transact-SQL)

FREETEXT (Transact-SQL)

SQLSERVER全文搜索

Fulltext Index Study3:Query

标签:

原文地址:http://www.cnblogs.com/ljhdo/p/5540518.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!