标签:doget nbsp 读取 children 设置 private row 核心 print
.*(关键词1|关键词2|关键词3).*
@WebServlet(name = "PatternControl", urlPatterns = {"/p"}) public class PatternControl extends HttpServlet { private static final Pattern pattern = initPattern(); private static Pattern initPattern() { List<String> stringList = null; try { stringList = Files.readAllLines(Paths.get("/Users/hans/Documents/word.txt")); } catch (IOException e) { e.printStackTrace(); } StringBuilder stringBuilder = new StringBuilder(".*("); stringBuilder.append(stringList.get(0)); for (int i = 1; i < stringList.size(); i++) { stringBuilder.append("|" + stringList.get(i)); } stringBuilder.append(").*"); Pattern pattern = Pattern.compile(stringBuilder.toString()); return pattern; } protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { if (pattern == null) { response.sendError(500, "pattern is null"); return; } if (request.getParameter("word") == null) { response.sendError(500, "word is null"); return; } boolean isMatcher = pattern.matcher(request.getParameter("word")).matches(); if (isMatcher) { response.sendError(209); } else { response.sendError(409); } } protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { doPost(request, response); } } |
关键词共有28448个,将其编译成上述的正则表达式
CPU | 2.2GHz Intel i7四核 |
---|---|
内存 |
16GB 1600 MHz DDR3 |
阶段 | 耗时(ms) |
---|---|
初始化 |
读取敏感词:38 编译正则表达式:41 |
每次匹配 | 47 |
阶段 | 消耗内存(MB) |
---|---|
初始化 编译正则表达式 | 11 |
每次匹配 | 极小 |
cpu和堆运行时情况图
利用正则表达式过滤敏感词效果较好
循环总共关键词个数次,判断时候在待匹配字符串中 是否包含 本次循环的关键词
@WebServlet(name = "PatternControl", urlPatterns = {"/p"}) public class PatternControl extends HttpServlet { private static final List<String> stringList = initStringList(); private static List<String> initStringList() { List<String> stringList = null; try { stringList = Files.readAllLines(Paths.get("/Users/hans/Documents/word.txt")); } catch (IOException e) { e.printStackTrace(); } return stringList; } private boolean matchKeyWord(String text) { for (String markWord : stringList) { if (text.contains(markWord)) { return true; } } return false; } protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { if (request.getParameter("word") == null) { response.sendError(500, "word is null"); return; } boolean isMatcher = matchKeyWord(request.getParameter("word")); if (isMatcher) { response.sendError(209); } else { response.sendError(409); } } protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { doPost(request, response); } } |
阶段 | 耗时(ms) |
---|---|
初始化 |
读取敏感词:38 |
每次匹配 | 10 |
阶段 | 消耗内存(MB) |
---|---|
初始化 | 3 |
每次匹配 | 极小 |
利用暴力匹配的效果更好
将所有的敏感词构成一棵路径树,每个节点只有一个字符
public class TestAcWithoutFail { public void insert(String str, Node root) { Node p = root; for (char c : str.toCharArray()) { if (p.childrenNodes.get(c) == null) { p.childrenNodes.put(c, new Node()); } p = p.childrenNodes.get(c); } p.isEnd = true; } public boolean isContainSensitiveWord(String text, Node root) { for (int i = 0; i < text.length(); i++) { Node nowNode = root; for (int j = i; j < text.length(); j++) { char word = text.charAt(j); nowNode = nowNode.childrenNodes.get(word); if (nowNode != null) { if (nowNode.isEnd) { return true; } } else { break; } } } return false; } public String containSensitiveWord(String text, Node root) { for (int i = 0; i < text.length(); i++) { Node nowNode = root; for (int j = i; j < text.length(); j++) { char word = text.charAt(j); nowNode = nowNode.childrenNodes.get(word); if (nowNode != null) { if (nowNode.isEnd) { return text.substring(i, j + 1); } } else { break; } } } return ""; } public static void main(String[] args) throws IOException, InterruptedException { List<String> stringList = Files.readAllLines(Paths.get("/Users/hans/Documents/word.txt")); TestAcWithoutFail acNoFail = new TestAcWithoutFail(); Node root = new Node(); for (String text : stringList) { acNoFail.insert(text, root); } String string = "tiresdfsdffffffffffaaaaaaaaaa"; Thread.sleep(10 * 1000); for (int i = 0; i < 1; i++) { System.out.println(acNoFail.isContainSensitiveWord(string, root)); } } } class Node { Map<Character, Node> childrenNodes = new HashMap<>(); boolean isEnd; } |
阶段 | 耗时(ms) |
---|---|
初始化 |
读取敏感词:38 构建比较树:36 |
每次匹配 | 0.01量级(执行1000遍34ms、执行10000遍130ms) |
阶段 | 消耗内存(MB) |
---|---|
初始化 |
读取字符串:3 构建比较树:24 |
每次匹配 | 极小 |
在该业务中和暴力匹配效果哪个好值得商榷
标签:doget nbsp 读取 children 设置 private row 核心 print
原文地址:http://www.cnblogs.com/ironPhoenix/p/6048692.html