关于java中敏感词检测的一些总结

时间：2014-09-02 00:28:14 阅读：472 评论：0 收藏：0 [点我收藏+]

标签：blog http os io 使用 java ar for 2014

之前项目里客户提出一个需求，需要对系统中使用文本转化成语音发送的功能进行敏感词检测，禁止用户提交有敏感词的语音。通过查询各方面资料，整理了大概几种方案:

项目启动时对载入敏感词库作为缓存（一个大map，敏感词为key，取任意值为value）。对请求传入的文本分词，遍历分词结果，每个分词在map中查找，如果有值，则请求文本存在敏感词。
把敏感词库拼接成一个大的正则表达式，然后直接对文本匹配。
使用DFA（确定性有限状态自动机） DFA算法

对于方案选择，在网上参考了很多别人的代码。最简单的是方法2使用正则表达式，但是据说文本一长会有很大的效率问题。关于方法3DFA算法，由于在学校的时候算法课和编译原理没有认真听讲（惭愧= =||），直接就忽略这方法了，所以最后还是决定使用方法1。
其实方法1还是有很多可以改进的方法，后来又参考了这个帖子12楼中的方法，使用索引数组加关联数组的方式，提高了检索效率,甚至连分词的步骤都省掉了。整个实现代码如下。

import org.apache.commons.lang.StringUtils;

import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.StringUtils;

import java.io.IOException;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;


/**
* User: eternity
* Date: 2014/8/11
* Time: 16:17
* 敏感词检测类
* 敏感词检测初始化规则：
* 将敏感词从词库载入，按照2字、3字、4字、5字等字数各生成一个敏感词哈希表。
* 在将这些哈希表组成一个数组banWordsList，数组下标表示该敏感词表字数
* banWordsList[2] = {某马:true,屏蔽:true,啦啦:true};
* banWordsList[3] = {某个马:true,三个字:true,啦啦啦:true,小广告:true};
* banWordsList[4] = {某个坏银:true,四个字符:true,哈哈哈哈:true,就爱凤姐:true};
* banWordsList[5] = {某个大法好:true,五个敏感字:true};
* 根据上面几组组敏感词，自动生成以下索引
* 生成规则为，索引名是敏感词第一个字，值是一个int
* 该int的规则为，该int转换成二进制时，第i位为1表示上面4表存在长度为i的敏感词，否则不存在长度为i的敏感词(10000)
* wordIndex = {二:0x04,三:0x08,四:0x10,五:0x20,某:0x3c,啦:0x0c,哈:0x10,小:0x08,就:0x10};
*
* 检查规则如下:
* 1，逐字检验，是否该字在wordIndex索引表中。
* 2，如果不在表中，继续检验
* 3，如果在表中，根据索引表该键的值，取此字以及此字后的若干字检验详细表banWordsList[索引词长]。
*
* 检验例子
* 有一段如下文字，检验其是否包含敏感词：
“我就打小广告，气死版主”
——检测“我”
|-不在索引表
——检测“就”
|-在索引表
|-“就”的索引值是0x10，表示有4字以“就”开头的敏感词
|-取“就”和后面的字共4个，组成“就打小广”
|-查4字敏感词表，没有这项，继续
——检测“打”
|-不在索引表
——检测“小”
|-在索引表
|-索引值是0x08，表示有3字长度的敏感词
|-取“小”和“小”后面的字，共3个字组成一个词“小广告”
|-“小广告”在3字敏感词中，此帖包含敏感词，禁止发布
*/
public class BanWordsUtil {
    // public Logger logger = Logger.getLogger(this.getClass());
    public static final int WORDS_MAX_LENGTH = 10;
    public static final String BAN_WORDS_LIB_FILE_NAME = "banWords.txt";

    //敏感词列表
    public static Map[] banWordsList = null;

    //敏感词索引
    public static Map<String, Integer> wordIndex = new HashMap<String, Integer>();

    /*
    * 初始化敏感词库
    */
    public static void initBanWordsList() throws IOException {
        if (banWordsList == null) {
            banWordsList = new Map[WORDS_MAX_LENGTH];

            for (int i = 0; i < banWordsList.length; i++) {
                banWordsList[i] = new HashMap<String, String>();
            }
        }

        //敏感词词库所在目录，这里为txt文本，一个敏感词一行
        String path = BanWordsUtil.class.getClassLoader()
                                        .getResource(BAN_WORDS_LIB_FILE_NAME)
                                        .getPath();
        System.out.println(path);

        List<String> words = FileUtils.readLines(FileUtils.getFile(path));

        for (String w : words) {
            if (StringUtils.isNotBlank(w)) {
                //将敏感词按长度存入map
                banWordsList[w.length()].put(w.toLowerCase(), "");

                Integer index = wordIndex.get(w.substring(0, 1));

                //生成敏感词索引，存入map
                if (index == null) {
                    index = 0;
                }

                int x = (int) Math.pow(2, w.length());
                index = (index | x);
                wordIndex.put(w.substring(0, 1), index);
            }
        }
    }

    /**
    * 检索敏感词
    * @param content
    * @return
    */
    public static List<String> searchBanWords(String content) {
        if (banWordsList == null) {
            try {
                initBanWordsList();
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        }

        List<String> result = new ArrayList<String>();

        for (int i = 0; i < content.length(); i++) {
            Integer index = wordIndex.get(content.substring(i, i + 1));
            int p = 0;

            while ((index != null) && (index > 0)) {
                p++;
                index = index >> 1;

                String sub = "";

                if ((i + p) < (content.length() - 1)) {
                    sub = content.substring(i, i + p);
                } else {
                    sub = content.substring(i);
                }

                if (((index % 2) == 1) && banWordsList[p].containsKey(sub)) {
                    result.add(content.substring(i, i + p));

                    // System.out.println("找到敏感词："+content.substring(i,i+p));
                }
            }
        }

        return result;
    }

    public static void main(String[] args) throws IOException {
        String content = "含有敏感词的测试语句。";
        BanWordsUtil.initBanWordsList();
        List<String> banWordList = BanWordsUtil.searchBanWords(content);
        for(String s : banWordLis){
         System.out.println("找到敏感词："+s);
        }
    }
}

上面测试语文本里面其实没有敏感词（我也怕被屏蔽XD）,测试的时候随便加入几个敏感词都能检测出来的。这样就实现了一个简易又快速的敏感词检测，当然如果有需要比较复杂的检测逻辑(比如说“弹吉他妈妈真漂亮”这样的)，还是要用到分词工具把词拆分一下的。

第一次用Markdown写作，哈哈：）

PS: 感谢讨论区里各位的耐心解答。：） http://www.oschina.net/question/1010578_164557

关于java中敏感词检测的一些总结

标签：blog http os io 使用 java ar for 2014

原文地址：http://my.oschina.net/u/1010578/blog/308904

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行