标签:
SELECT SERVERPROPERTY(‘IsFullTextInstalled‘);
全文索引支持的数据类型包括CHAR, VARCHAR, NCHAR, NVARCHAR,TEXT, NTEXT, IMAGE, XML, and VARBINARY(MAX)
过滤器(ifilters) 用来提取文本信息,并且移除文档的格式
EXEC sys.sp_help_fulltext_system_components ‘filter‘; 或者 SELECT document_type, path FROM sys.fulltext_document_types;
可以安装其他filter,比如 Microsoft Office 2010 document
http://www.microsoft.com/en-us/download/details.aspx?id=17062
安装好filter包以后需要在SQL Server中注册,使用以下语句
EXEC sys.sp_fulltext_service ‘load_os_resources‘, 1;
可能需要重启一下在用sys.sp_help_fulltext_components检查是否安装成功。
断字符和词干分析器用于对所有全文索引数据执行语言分析。语言分析将涉及到查找词边界(断字)和组合动词(词干分析)。断字符和词干分析器是特定于语言的,并且各语言的语言分析规则也各不相同。对于给定语言,“断字符”通过根据语言的词法规则确定词的边界位置来标识各个词。每个词(也称为“标记”tokens)使用压缩表示形式插入全文索引以减少其大小。“词干分析器”根据该语言的规则生成特定词的变形形式(inflectional forms of a word)(例如,“running”、“ran”和“runner”是单词“run”的不同形式)。
以下语句可以查看SQL Server支持的语言
SELECT lcid , name FROM sys.fulltext_languages ORDER BY name;
默认非本地化的情况下使用的是英语。可以用存储sys.sp_configure来更改全文搜索的默认语言
USE AdventureWorks2012 ; GO EXEC sp_configure ‘show advanced options‘, 1 ; GO RECONFIGURE GO EXEC sp_configure ‘default full-text language‘, 1043 ; GO RECONFIGURE GO
有些干扰词(noise words)无益于查询,可以创建非索引字表(stoplists of stopwords)
以下语句可以查看当前数据库中用的非索引字符
SELECT stoplist_id, names FROM sys.fulltext_stoplists; SELECT stoplist_id, stopword, language FROM sys.fulltext_stopwords;
全文搜索也支持同义词查询。 这些同义词放在thesaurus文件中。每种语言都分配了一个XML thesaurus 文件。目录如下:
SQL_Server_install_path\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\FTDATA\
diacritics_sensitive 是否区分重音 ,0 不区分,1 区分。
expansion 使用附加词进行搜索,比如你添加了“auther”和“writer”。当用户搜索“writer”的时候同事搜索“auther”
replacement 搜索某个词的时候替换成另外一个词进行搜索。比如用“Windows 2008”替换“Win 2k8”。
Thesaurus文件改好以后还需要加载
EXEC sys.sp_fulltext_load_thesaurus_file 1033
全文目录是一个虚拟的对象,它不属于任何文件组,里面存储了全文索引。 其中全文索引语法中多了一个选项 STATISTICAL_SEMANTICS (语义搜索)
统计语义搜索通过提取统计上相关的“关键短语”并对其进行索引,提供对 SQL Server 中存储的非结构化文档的更深层次剖析。 然后,它还使用这些关键短语标识“相似或相关文档”并对其进行索引。 您通过使用三个 Transact-SQL 行集函数将结果作为结构化数据检索,查询这些语义索引。
语义搜索扩展了全文搜索的功能,可以让你查询文档的含义。比如说,你可以通过关键词索引建立文档分类。你可以通过简历查询相匹配的工作描述。语义搜索使得你可以建立自己的文本挖掘方案。
我们先建立一个临时表,填充一些文本,然后建立stopwords和stoplist。最后搜索文档内容
column name | Data type | nullability | remarks |
id | INT | NOT NULL | IDENTITY, PRIMARY KEY |
title | NVARCHAR(100) | NOT NULL | Name of the documents you are going to import |
doctype | NCHAR(4) | NOT NULL | Type of the documents you are going to import |
docexcerpt | NVARCHAR(1000) | NOT NULL | Excerpt of the documents you are going to import |
doccontent | VARBINARY(MAX) | NOT NULL | Documents you are going to import |
--Table for documents CREATE TABLE dbo.Documents ( id INT IDENTITY(1,1) NOT NULL, title NVARCHAR(100) NOT NULL, doctype NCHAR(4) NOT NULL, docexcerpt NVARCHAR(1000) NOT NULL, doccontent VARBINARY(MAX) NOT NULL, CONSTRAINT PK_Documents PRIMARY KEY CLUSTERED(id) ); GO
--Insert data INSERT INTO dbo.Documents (title, doctype, docexcerpt, doccontent) SELECT N‘Columnstore Indices and Batch Processing‘, N‘docx‘, N‘You should use a columnstore index on your fact tables, putting all columns of a fact table in a columnstore index. In addition to fact tables, very large dimensions could benefit from columnstore indices as well. Do not use columnstore indices for small dimensions. ‘, bulkcolumn FROM OPENROWSET(BULK ‘C:\TK461\ColumnstoreIndicesAndBatchProcessing.docx‘, SINGLE_BLOB) AS doc; INSERT INTO dbo.Documents (title, doctype, docexcerpt, doccontent) SELECT N‘Introduction to Data Mining‘, N‘docx‘, N‘Using Data Mining is becoming more a necessity for every company and not an advantage of some rare companies anymore. ‘, bulkcolumn FROM OPENROWSET(BULK ‘C:\TK461\IntroductionToDataMining.docx‘, SINGLE_BLOB) AS doc; INSERT INTO dbo.Documents (title, doctype, docexcerpt, doccontent) SELECT N‘Why Is Bleeding Edge a Different Conference‘, N‘docx‘, N‘During high level presentations attendees encounter many questions. For the third year, we are continuing with the breakfast Q&A session. It is very popular, and for two years now, we could not accommodate enough time for all questions and discussions! ‘, bulkcolumn FROM OPENROWSET(BULK ‘C:\TK461\WhyIsBleedingEdgeADifferentConference.docx‘, SINGLE_BLOB) AS doc; INSERT INTO dbo.Documents (title, doctype, docexcerpt, doccontent) SELECT N‘Additivity of Measures‘, N‘docx‘, N‘Additivity of measures is not exactly a data warehouse design problem. However, you have to realize which aggregate functions you will use in reports for which measure, and which aggregate functions you will use when aggregating over which dimension.‘, bulkcolumn FROM OPENROWSET(BULK ‘C:\TK461\AdditivityOfMeasures.docx‘, SINGLE_BLOB) AS doc; GO
文档属性的内容先前无法与文档正文的内容区分。 此局限性将全文查询限制为针对整个文档进行一般搜索。 但现在,对于 varbinary、varbinary(max)(包括 FILESTREAM)或 image 二进制数据列中支持的文档类型,您可以配置全文索引以支持对特定属性(如 Author 和 Title)进行属性范围内的搜索。 这种形式的搜索称为“属性搜索”。
-- Search property list CREATE SEARCH PROPERTY LIST WordSearchPropertyList; GO ALTER SEARCH PROPERTY LIST WordSearchPropertyList ADD ‘Authors‘ WITH (PROPERTY_SET_GUID = ‘F29F85E0-4FF9-1068-AB91-08002B27B3D9‘, PROPERTY_INT_ID = 4, PROPERTY_DESCRIPTION = ‘System.Authors - authors of a given item.‘); GO
-- Stopwords list CREATE FULLTEXT STOPLIST SQLStopList; GO ALTER FULLTEXT STOPLIST SQLStopList ADD ‘SQL‘ LANGUAGE ‘English‘; GO
-- Check the Stopwords list SELECT w.stoplist_id, l.name, w.stopword, w.language FROM sys.fulltext_stopwords AS w INNER JOIN sys.fulltext_stoplists AS l ON w.stoplist_id = l.stoplist_id; GO
sys.dm_fts_parser 返回将给定断字符、同义词库和非索引字表组合应用于查询字符串输入后生成的最终词语切分结果。此词语切分结果等效于全文引擎针对指定查询字符串的输出。
--第一个参数是用来分析的字符串,第二个是语言ID,第三个是stoplist ID (用前面的语句查询),第四个是否解析重音 SELECT * FROM sys.dm_fts_parser (N‘"Additivity of measures is not exactly a data warehouse design problem. However, you have to realize which aggregate functions you will use in reports for which measure, and which aggregate functions you will use when aggregating over which dimension."‘, 1033, 5, 0); SELECT * FROM sys.dm_fts_parser (‘FORMSOF(INFLECTIONAL,‘+ ‘function‘ + ‘)‘, 1033, 5, 0);
如果以下语句无任何结果返回则需要安装,安装文件是x64\Setup 里面的SemanticLanguageDatabase.msi
SELECT * FROM sys.fulltext_semantic_language_statistics_database;
CREATE DATABASE semanticsdb ON (FILENAME = ‘C:\Program Files\Microsoft Semantic Language Database\semanticsdb.mdf‘), (FILENAME = ‘C:\Program Files\Microsoft Semantic Language Database\semanticsdb_log.ldf‘) FOR ATTACH;
EXEC sp_fulltext_semantic_register_language_statistics_db @dbname = N‘semanticsdb‘; GO
CREATE FULLTEXT CATALOG DocumentsFtCatalog;
-- Full-text index CREATE FULLTEXT INDEX ON dbo.Documents ( docexcerpt Language 1033, doccontent TYPE COLUMN doctype Language 1033 STATISTICAL_SEMANTICS ) KEY INDEX PK_Documents ON DocumentsFtCatalog WITH STOPLIST = SQLStopList, SEARCH PROPERTY LIST = WordSearchPropertyList, CHANGE_TRACKING AUTO; GO
CONTAINS和FREETEXT 他们都是在 WHERE 子句中使用的一个谓词,用于搜索含有基于字符的数据类型的列,以查找含义与搜索条件中的单词相同但不完全匹配的值。首先,CONTAINS可用于检索单个词或短语的精确或模糊匹配。FREETEXT则全文查询引擎将在内部对 输入的单词、短语或句子执行一些操作,如断词,变形形式,或者thesaurus制定的扩展或替代词等。再进行查询操作。
SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N‘data‘);
-- Logical operators - OR SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N‘data OR index‘);
-- Logical operators - AND NOT SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N‘data AND NOT mining‘);
-- Logical operators - parentheses SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N‘data OR (fact AND warehouse)‘);
-- Phrase SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N‘"data warehouse"‘);
-- Prefix SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N‘"add*"‘);
查找单词”problem“与”data“在一起的行,未指定次序以及距离
-- Simple proximity SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N‘NEAR(problem, data)‘);
设置最大距离
-- Proximity with max distance SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N‘NEAR((problem, data),5)‘); SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N‘NEAR((problem, data),1)‘);
按照次序
-- Proximity with max distance and order SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N‘NEAR((problem, data),5, TRUE)‘);
-- Inflectional forms -- The next query does not return any rows SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N‘presentation‘); -- The next query returns a row SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N‘FORMSOF(INFLECTIONAL, presentation)‘); GO
<XML ID="Microsoft Search Thesaurus"> <thesaurus xmlns="x-schema:tsSchema.xml"> <diacritics_sensitive>0</diacritics_sensitive> <expansion> <sub>Internet Explorer</sub> <sub>IE</sub> <sub>IE5</sub> </expansion> <replacement> <pat>NT5</pat> <pat>W2K</pat> <sub>Windows 2000</sub> </replacement> <expansion> <sub>run</sub> <sub>jog</sub> </expansion> <expansion> <sub>need</sub> <sub>necessity</sub> </expansion> </thesaurus> </XML>
EXEC sys.sp_fulltext_load_thesaurus_file 1033;
-- Synonyms -- The next query does not return any rows SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N‘need‘); -- The next query returns a row SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N‘FORMSOF(THESAURUS, need)‘);
SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(PROPERTY(doccontent,‘Authors‘), ‘Dejan‘);
SELECT id, title, doctype, docexcerpt FROM dbo.Documents WHERE FREETEXT(docexcerpt, N‘data presentation need‘);
全文索引可以用权重来对文档排名(排名计算很复杂,比如词频,数量,NEAR字句,ISABOUT字句索引的行数等)。 可以用CONTAINSTABLE 和 FREETEXTTABLE函数。此外语意查询可用三个表值函数 SEMANTICKEYPHRASETABLE, SEMANTICSIMILARITYDETAILSTABLE,和 SEMANTICSIMILARITYTABLE
-- Rank with CONTAINSTABLE SELECT D.id, D.title, CT.[RANK], D.docexcerpt FROM CONTAINSTABLE(dbo.Documents, docexcerpt, N‘data OR level‘) AS CT INNER JOIN dbo.Documents AS D ON CT.[KEY] = D.id ORDER BY CT.[RANK] DESC;
-- Rank with FREETEXTTABLE SELECT D.id, D.title, FT.[RANK], D.docexcerpt FROM FREETEXTTABLE (dbo.Documents, docexcerpt, N‘data level‘) AS FT INNER JOIN dbo.Documents AS D ON FT.[KEY] = D.id ORDER BY FT.[RANK] DESC;
-- Weighted terms SELECT D.id, D.title, CT.[RANK], D.docexcerpt FROM CONTAINSTABLE (dbo.Documents, docexcerpt, N‘ISABOUT(data weight(0.8), level weight(0.2))‘) AS CT INNER JOIN dbo.Documents AS D ON CT.[KEY] = D.id ORDER BY CT.[RANK] DESC;
-- Proximity term SELECT D.id, D.title, CT.[RANK] FROM CONTAINSTABLE (dbo.Documents, doccontent, N‘NEAR((data, row), 30)‘) AS CT INNER JOIN dbo.Documents AS D ON CT.[KEY] = D.id ORDER BY CT.[RANK] DESC;
-- 1. Top 20 semantic key phrases SELECT TOP (20) D.id, D.title, SKT.keyphrase, SKT.score FROM SEMANTICKEYPHRASETABLE (dbo.Documents, doccontent) AS SKT INNER JOIN dbo.Documents AS D ON SKT.document_key = D.id ORDER BY SKT.score DESC; -- 2. Documents that are similar to document 1 SELECT SST.matched_document_key, D.title, SST.score FROM SEMANTICSIMILARITYTABLE (dbo.Documents, doccontent, 1) AS SST INNER JOIN dbo.Documents AS D ON SST.matched_document_key = D.id ORDER BY SST.score DESC; -- 3. Semantic search key phrases that are common to two documents SELECT SSDT.keyphrase, SSDT.score FROM SEMANTICSIMILARITYDETAILSTABLE (dbo.Documents, doccontent, 1, doccontent, 4) AS SSDT ORDER BY SSDT.score DESC; GO -- 4. Clean up DROP TABLE dbo.Documents; DROP FULLTEXT CATALOG DocumentsFtCatalog; DROP SEARCH PROPERTY LIST WordSearchPropertyList; DROP FULLTEXT STOPLIST SQLStopList; GO
断字符和词干分析器
http://blog.csdn.net/tianlianchao1982/article/details/5806193
SQL Server -- 从全文索引中去掉普通字符串
http://blog.csdn.net/fan158/article/details/6273813
CREATE FULLTEXT INDEX (Transact-SQL)
https://msdn.microsoft.com/zh-cn/library/ms187317.aspx
语义搜索 (SQL Server)
https://msdn.microsoft.com/zh-cn/library/gg492075.aspx#find1
使用搜索属性列表搜索文档属性
https://msdn.microsoft.com/zh-SG/library/ee677637
Find Property Set GUIDs and Property Integer IDs for Search Properties
https://msdn.microsoft.com/en-us/library/ee677618.aspx
Windows Properties
https://msdn.microsoft.com/library/dd561977.aspx
全文检索,CONTAINS 和FREETEXT的用法比较
标签:
原文地址:http://www.cnblogs.com/haseo/p/4307379.html