一个有意思的需求——中文匹配度

时间：2014-07-13 19:13:57 阅读：266 评论：0 收藏：0 [点我收藏+]

引言

　　最近LZ带头在做一个互联网项目，互联网的东西总是那么新鲜，这也难怪大部分猿友都喜欢互联网。这个互联网项目不仅让LZ开发了一个HBase大数据应用，近期的一次需求讨论会上，又出来一个小需求，蛮有意思的。这些需求在之前枯燥的企业内部应用开发中，还是很难见到的，毕竟内部应用更多的是业务流程的体现。

　　具体的需求这里不方便透露，但简单的描述一下需求，就是如何判断两个公司名是一个。这其实就是Java当中字符串的相等判断，最简单的当然是用equals来判断。但是由于实际情况是，公司名是由客户手动输出的，难免有小小的偏差，因此equals自然就不适用了。比如“博客园科技发展有限公司”和“博客园科技发展（北京）有限公司”，这两个显然应该是一个公司，但是如果用equals，自然结果会是false。

方案提出

　　会议上，LZ简单提出了一个解决方案，就是利用分词来计算两者的匹配度，如果匹配度达到一定数值，我们就认为两个字符串是相等的。不过不论算法如何高明，准确率都不可能达到100%，这点也是LZ一再给业务同事强调的，否则到时候匹配错了，他们找LZ的茬可咋办。不过好在业务同事只是希望有一个提示功能，并不做严格的判断标准，所以系统需要做的只是初步的判断，因此准确率要求并不是特别的高，只能说越高越好。

　　由于这个项目是LZ以PM介入开发的，而且LZ一直都兼任SM，所以技术方案自然是LZ说了算了。没有讨论，没有异议，方案就这么在会议上定了。

小研究

　　由于LZ最近负责的事情比较多，所以上班的时候自然是没有时间研究这些东西，只能趁着周末小小研究一下。其实LZ对于分词并不是特别了解，这些东西与算法的关联度比较高，而LZ的算法基础只能说“呵呵”。不过不管怎样，作为一个项目的领头人，总得向前冲。如果到时候算法的时间、空间成本太高，或者准确率太低，等着后续寻找大神优化也不晚。

　　这件事的思路比较清晰，LZ需要做的就是两件事，第一件事就是“分词”，第二件事就是“匹配”。

分词

　　分词是比较简单的一步，有很多现成的类库可以使用，只要选择一个使用就可以了。Lucene是LZ第一个想到的，毕竟LZ也算是Apache的脑残粉，因此话不多说，第一时间就下载Lucene，开始了探究之旅。

　　以下是LZ在网络上找到的示例，稍微进行了一点修改。

package com.creditease.borrow.lucene;

import java.io.IOException;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;

public class App 
{
    public static void main( String[] args ) throws IOException
    {
        String text = "博客园科技发展（北京）有限公司";
        SmartChineseAnalyzer smartChineseAnalyzer = new SmartChineseAnalyzer(Version.LUCENE_47);
        TokenStream tokenStream = smartChineseAnalyzer.tokenStream("field", text);
        CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
        tokenStream.reset();
        while (tokenStream.incrementToken()) {
            System.out.print(charTermAttribute.toString() + "  ");
        }
        tokenStream.end();
        tokenStream.close();
        smartChineseAnalyzer.close();
    }
    
}

　　输出结果为以下内容。

博  客  园  科技  发展  北京  有限公司

　　这种分词结果勉强可以接受了，最理想的应该是将“博客”放在一起，不过看来Lucene的字典里没有这个词。没关系，这对我们的计划并不影响，我们接下来开始着手匹配的事。

匹配

　　有了分词，我们要做的，就是匹配两个字符串数组，并得到一个匹配度。既然是匹配度，那么肯定就有分母和分子。我们可以挑其中一个数组的长度为分母，以另外一个数组中的元素找到匹配分词的个数作为分子。为了减少程序的复杂度，我们采用Set集合的去重特点来进行计算。就像以下程序这样。

package com.creditease.borrow.lucene;

import java.io.IOException;
import java.util.HashSet;
import java.util.Set;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;

public class App 
{
    
    private static SmartChineseAnalyzer smartChineseAnalyzer = new SmartChineseAnalyzer(Version.LUCENE_47);
    
    public static void main( String[] args ) throws IOException
    {
        String text1 = "博客园科技发展（北京）有限公司";
        String text2 = "博客园科技发展有限公司";
        System.out.println(oneWayMatch(text1, text2));
    }
    
    public static double oneWayMatch(String text1,String text2) {
        try {
            Set<String> set = new HashSet<String>(10);
            TokenStream tokenStream = smartChineseAnalyzer.tokenStream("field", text1);
            CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
            tokenStream.reset();
            while (tokenStream.incrementToken()) {
                set.add(charTermAttribute.toString());
            }
            int denominator = set.size();
            tokenStream.end();
            tokenStream.close();
            tokenStream = smartChineseAnalyzer.tokenStream("field", text2);
            charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
            tokenStream.reset();
            while (tokenStream.incrementToken()) {
                set.add(charTermAttribute.toString());
            }
            int numerator = set.size() - denominator;
            double unmatchRate = ((double)numerator)/denominator;
            tokenStream.end();
            tokenStream.close();
            return unmatchRate;
        } catch (IOException e) {
            return 1D;
        }
        
    }
    
}

　　输出的结果为0，也就是说匹配度为100%。这显然是不对的，两个字符串很明显不是一模一样，得到匹配度为100%说明我们的算法还是有问题。

　　仔细分析一下，问题就出在我们是拿text2中不匹配的分词作为分子，而text2中的分词在text1中全部都包含，因此分子最终会是0，那么不匹配度自然就是0（unmatchRate）。为了弥补这一缺陷，LZ想来想去最终还是决定采取“双向”（twoWay）的办法解决这个问题，可以看到LZ给上面那个方法取的名字为“单向”匹配（oneWay）。

　　双向匹配其实就是将两者顺序颠倒，再进行一次匹配而已，因此我们的程序可以简单的更改为以下形式。

package com.creditease.borrow.lucene;

import java.io.IOException;
import java.util.HashSet;
import java.util.Set;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;

public class App 
{
    
    private static SmartChineseAnalyzer smartChineseAnalyzer = new SmartChineseAnalyzer(Version.LUCENE_47);
    
    public static void main( String[] args ) throws IOException
    {
        String text1 = "博客园科技发展（北京）有限公司";
        String text2 = "博客园科技发展有限公司";
        System.out.println(twoWayMatch(text1, text2));
    }
    
    public static double twoWayMatch(String text1,String text2) {
        return (oneWayMatch(text1, text2) + oneWayMatch(text2, text1));
    }
    
    public static double oneWayMatch(String text1,String text2) {
        try {
            Set<String> set = new HashSet<String>(10);
            TokenStream tokenStream = smartChineseAnalyzer.tokenStream("field", text1);
            CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
            tokenStream.reset();
            while (tokenStream.incrementToken()) {
                set.add(charTermAttribute.toString());
            }
            int denominator = set.size();
            tokenStream.end();
            tokenStream.close();
            tokenStream = smartChineseAnalyzer.tokenStream("field", text2);
            charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
            tokenStream.reset();
            while (tokenStream.incrementToken()) {
                set.add(charTermAttribute.toString());
            }
            int numerator = set.size() - denominator;
            double unmatchRate = ((double)numerator)/denominator;
            tokenStream.end();
            tokenStream.close();
            return unmatchRate;
        } catch (IOException e) {
            return 1D;
        }
        
    }
    
}

　　该程序的输出结果为0.1666666666...，也就是0.17，也就是说两者的匹配度为83%。这个值显然更加接近实际的情况。可以看到，双向匹配以单向匹配为基础，将顺序颠倒的结果相加，就能得到不匹配度。

　　事情原本到此就可以结束了，但是还有一个更大的难点没有处理。那就是匹配度到底设置为多少合适。这个问题到目前为止，LZ还没有更好的办法。按照上面的小例子来讲，自然是设为80%就可以。

　　但是看下下面这个例子，就会发现这个数值很不正确，或者说匹配算法还是有问题。

public static void main( String[] args ) throws IOException
    {
        String text1 = "博客园科技发展（北京）有限公司";
        String text2 = "博客园有限公司";
        System.out.println(twoWayMatch(text1, text2));
    }

　　运行的结果为0.75，也就是说匹配度大约只有25%。但是很明显，上面两个公司实际上匹配度是很高的，因为最重要的三个字是匹配的。

　　再仔细分析一下，问题就变成权重了。也就是说，每个词的权重应该是不一样的，这样的话，匹配的准确度可能会更高一点。我们稍微改善一下程序。（以下统一使用后面加双斜线的方式表示程序主要改变的地方）

package com.creditease.borrow.lucene;

import java.io.IOException;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;

public class App 
{
    
    private static SmartChineseAnalyzer smartChineseAnalyzer = new SmartChineseAnalyzer(Version.LUCENE_47);
    
    private static List<String> smallWeightWords = Arrays.asList("公司","有限公司","科技","发展","股份");
    
    private static double smallWeight = 0.3D;
    
    public static void main( String[] args ) throws IOException
    {
        String text1 = "博客园科技发展（北京）有限公司";
        String text2 = "博客园有限公司";
        System.out.println(twoWayMatch(text1, text2));
    }
    
    public static double twoWayMatch(String text1,String text2) {
        return (oneWayMatch(text1, text2) + oneWayMatch(text2, text1));
    }
    
    public static double oneWayMatch(String text1,String text2) {
        try {
            Set<String> set = new HashSet<String>(10);
            TokenStream tokenStream = smartChineseAnalyzer.tokenStream("field", text1);
            CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
            tokenStream.reset();
            while (tokenStream.incrementToken()) {
                set.add(charTermAttribute.toString());
            }
            int denominator = set.size();
            tokenStream.end();
            tokenStream.close();
            tokenStream = smartChineseAnalyzer.tokenStream("field", text2);
            charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
            tokenStream.reset();
            int smallWeightWordsCount = 0;//
            while (tokenStream.incrementToken()) {
                String word = charTermAttribute.toString();//
                int tempSize = set.size();//
                set.add(word);//
                if (tempSize + 1 == set.size() && smallWeightWords.contains(word)) {//
                    smallWeightWordsCount++;//
                }//
            }
            int numerator = set.size() - denominator;
            double unmatchRate = (smallWeightWordsCount * smallWeight + numerator - ((double)smallWeightWordsCount))/denominator;//
            tokenStream.end();
            tokenStream.close();
            return unmatchRate;
        } catch (IOException e) {
            return 1D;
        }
        
    }
    
}

　　现在程序的输出结果为0.4，匹配度为60%。从结果来看，依然有点不尽人意。仔细分析一下程序，会发现，我们计算不匹配度的时候，是交叉计算的。也就是说，我们使用一个数组中不匹配的数目去除以另外一个数组的大小，这可能会造成“极端”数值。

　　我们需要调整程序，让数组自己与自己计算，这样就不会出现那种情况。如下。

package com.creditease.borrow.lucene;

import java.io.IOException;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;

public class App 
{
    
    private static SmartChineseAnalyzer smartChineseAnalyzer = new SmartChineseAnalyzer(Version.LUCENE_47);
    
    private static List<String> smallWeightWords = Arrays.asList("公司","有限公司","科技","发展","股份");
    
    private static double smallWeight = 0.3D;
    
    public static void main( String[] args ) throws IOException
    {
        String text1 = "博客园科技发展（北京）有限公司";
        String text2 = "博客园有限公司";
        System.out.println(twoWayMatch(text1, text2));
    }
    
    public static double twoWayMatch(String text1,String text2) {
        return (oneWayMatch(text1, text2) + oneWayMatch(text2, text1));
    }
    
    public static double oneWayMatch(String text1,String text2) {
        try {
            Set<String> set = new HashSet<String>(10);
            TokenStream tokenStream = smartChineseAnalyzer.tokenStream("field", text1);
            CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
            tokenStream.reset();
            while (tokenStream.incrementToken()) {
                set.add(charTermAttribute.toString());
            }
            int originalCount = set.size();//
            tokenStream.end();
            tokenStream.close();
            tokenStream = smartChineseAnalyzer.tokenStream("field", text2);
            charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
            tokenStream.reset();
            int smallWeightWordsCount = 0;
            int denominator = 0;//
            while (tokenStream.incrementToken()) {
                denominator++;//
                String word = charTermAttribute.toString();
                int tempSize = set.size();
                set.add(word);
                if (tempSize + 1 == set.size() && smallWeightWords.contains(word)) {
                    smallWeightWordsCount++;
                }
            }
            int numerator = set.size() - originalCount;
            double unmatchRate = (smallWeightWordsCount * smallWeight + numerator - ((double)smallWeightWordsCount))/denominator;//
            tokenStream.end();
            tokenStream.close();
            return unmatchRate;
        } catch (IOException e) {
            return 1D;
        }
        
    }
    
}