[IR] Information Extraction

时间：2016-11-08 20:25:41 阅读：230 评论：0 收藏：0 [点我收藏+]

标签：its graph ref depend idf oba spider sim 搜索

阶段性总结

Boolean retrieval

单词搜索

【Qword1 and Qword2】 O(x+y)

【Qword1 and Qword2】- 改进： Galloping Search O(2a*log₂(b/a))

【Qword1 and not Qword2】 O(m*log₂n)

【Qword1 or not Qword2】 O(m+n)

【Qword1 and Qword2 and Qword3 and ...】 O(Total_Length * log₂k)

句子搜索

1. Biword Indexes

2. Positional Index --> Proximity Queries

Index Construction

构建过程中的Sort的探索：

基于块的排序索引方法
内存式单遍扫描索引构建方法
动态索引 - Dynamic Indexing

Compression

Heaps’ law: M = kT^b

Zipf’s law: cf_i = K/i

压缩Dictionary

压缩Posting list

思路：基本查询，构建，然后压缩

Tolerant Retrieval & Spelling Correction & Language Model

WILD-CARD QUERIES

prefix　
suffix
"mon*ing"
“Permuterm vocabulary"
K-gram indexes

Spelling Correction

(1) Error detection

(2) Error correction

Language Model

查询似然模型 --> 混合模型：Jelinek-Mercer method

求Query在M_d中出现的概率，然后Ranking.

Probabilistic Model

二值独立模型 - Binary Independence Model

针对一个Query，某Term是否该出现在文档中呢？

一篇New doc出现，遂统计every Term与该doc的关系，得到C_i。

Link Analysis

In degree i 正比于 1/i^α, 例如: α = 2.1

1. Number of In Degree.

2. "Flow" Model

- small graphs.
- large graphs. (Markov渐进性质)

- - Spider traps
  - Dead Ends

Ranking - top k

精确方式：

Consine Similarity: tf-idf

精确加速：

使用Quick Select：n + k * log(k) : "find top k" + "sort top k"

Threshold Methods - MaxScore Method

模糊加速：

Index Elimination (heuristic function)

3 of 4 query terms

Champion List

Cluster Pruning Method

Evaluation

无序检索结果的评价方法
有序检索结果的评价方法

大目标 --> 小目标

• Text Categorization:
　　– Classify an entire document

• Information Extraction (IE):
　　– Identify and classify small units within documents

segmentation: 提取Term (NE) 语法
classification: 认识Term (type, Chunking) 语义
association: 聚类Term

• Named Entity Extraction (NE):
　　– A subset of IE
　　– Identify and classify proper names: "People, locations, organizations"

技术分享

Main tasks
• Named Entity Recognition
• Relation Extraction

Pattern-based Relation Extraction

– Relation extraction and its difficulties

– Use of POS Tags
– Use of Constituent Parse
– Use of Dependency Parse

技术分享

[IR] Information Extraction

标签：its graph ref depend idf oba spider sim 搜索

原文地址：http://www.cnblogs.com/jesse123/p/6044106.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行