倒排索引的简单实现

时间：2015-05-04 18:12:12 阅读：113 评论：0 收藏：0 [点我收藏+]

倒排索引的简单实现

倒排索引是搜索引擎中常用的算法，主要用来实现full text searching，建立关键词和所在文档的映射关系，很多强大的功能都建立在此基础之上，关于Inverted Index的详尽描述可以看Wikipedia。下面按照自己的想法实现之，只是为了体会这个数据结构的运作。

todo：如果要搜完整的出现一句话如“what is it”可以分别搜这几个单词然后看出现在同一个文件连续位置的结果即可，集合运算。

package mythought.invertedindex;

import java.io.BufferedReader;

import java.io.FileReader;

import java.util.HashMap;

import java.util.HashSet;

import java.util.Map;

import java.util.Set;

public class InvertedIndex {

// key word <----> doc file

private Map<String, Set<String>> indexs = new HashMap<String, Set<String>>();

// 这里假设都是小文件

public void addFile(String fileName, String content) {

String[] words = content.split(" ");

for (int i = 0; i < words.length; i++) {

String word = words[i];

// only record first appeared position

Set<String> wordIndex = indexs.get(word);

if (wordIndex == null) {

wordIndex = new HashSet<String>();

indexs.put(word, wordIndex);

}

wordIndex.add("("+fileName + "," + i+")");

}

public void addFile(String fileName) throws Exception{

BufferedReader br = new BufferedReader(new FileReader(fileName));

try {

StringBuilder sb = new StringBuilder();

String line = br.readLine();

while (line != null) {

sb.append(line);

sb.append(" ");// 每行直接串接

line = br.readLine();

}

this.addFile(fileName, sb.toString());

} catch(Exception e){

e.printStackTrace();

}finally {

br.close();

}

public Set<String> search(String keyword) {

Set<String> results = indexs.get(keyword);

return new HashSet<String>(results);

}

public static void main(String[] args) throws Exception{

InvertedIndex test = new InvertedIndex();

test.addFile("file1", "hello fuck world todotodo");

test.addFile("file2", "go to get it if you want it");

test.addFile("C:/data/hello.txt");

System.out.println(test.search("it"));

System.out.println(test.search("you"));

System.out.println(test.search("vonzhou"));

}

运行结果：

[(file2,7), (file2,3)]

[(file2,5)]

[(C:/data/hello.txt,4)]

参考：

1.http://en.wikipedia.org/wiki/Inverted_index

倒排索引的简单实现

标签：inverted index

原文地址：http://blog.csdn.net/vonzhoufz/article/details/45482663

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行