2. 观点提取和聚类代码详解

时间：2019-01-16 21:36:11 阅读：190 评论：0 收藏：0 [点我收藏+]

标签：bre set 如何 etop 复杂度 getter roo self .so

1. pyhanlp介绍和简单应用

2. 观点提取和聚类代码详解

1. 前言

本文介绍如何在无监督的情况下，对文本进行简单的观点提取和聚类。

2. 观点提取

观点提取是通过依存关系的方式，根据固定的依存结构，从原文本中提取重要的结构，代表整句的主要意思。

我认为比较重要的依存关系结构是"动补结构", "动宾关系", "介宾关系"3个关系。不重要的结构是"定中关系", "状中结构", "主谓关系"。通过核心词ROOT出发，来提取观点。

观点提取的主要方法如下，完整代码请移步致github。

''' 
关键词观点提取，根据关键词key，找到关键处的rootpath，寻找这个root中的观点，观点提取方式和parseSentence的基本一样。
支持提取多个root的观点。
'''
def parseSentWithKey(self, sentence, key=None):
    #key是关键字，如果关键字存在，则只分析存在关键词key的句子，如果没有key，则不判断。
    if key:
        keyIndex = 0
        if key not in sentence:
            return []
    rootList = []
    parse_result = str(self.hanlp.parseDependency(sentence)).strip().split('\n')
    # 索引-1，改正确，因为从pyhanlp出来的索引是从1开始的。
    for i in range(len(parse_result)):
        parse_result[i] = parse_result[i].split('\t')
        parse_result[i][0] = int(parse_result[i][0]) - 1
        parse_result[i][6] = int(parse_result[i][6]) - 1
        if key and parse_result[i][1] == key:
            keyIndex = i

    for i in range(len(parse_result)):
        self_index = int(parse_result[i][0])
        target_index = int(parse_result[i][6])
        relation = parse_result[i][7]
        if relation in self.main_relation:
            if self_index not in rootList:
                rootList.append(self_index)
        # 寻找多个root，和root是并列关系的也是root
        elif relation == "并列关系" and target_index in rootList:
            if self_index not in rootList:
                rootList.append(self_index)


        if len(parse_result[target_index]) == 10:
            parse_result[target_index].append([])

        #对依存关系，再加一个第11项，第11项是一个当前这个依存关系指向的其他索引
        if target_index != -1 and not (relation == "并列关系" and target_index in rootList):
            parse_result[target_index][10].append(self_index)
    
    # 寻找key在的那一条root路径
    if key:
        rootIndex = 0
        if len(rootList) > 1:
            target = keyIndex
            while True:
                if target in rootList:
                    rootIndex = rootList.index(target)
                    break
                next_item = parse_result[target]
                target = int(next_item[6])
        loopRoot = [rootList[rootIndex]]
    else:
        loopRoot = rootList

    result = {}
    related_words = set()
    for root in loopRoot:
        # 把key和root加入到result中
        if key:
            self.addToResult(parse_result, keyIndex, result, related_words)
        self.addToResult(parse_result, root, result, related_words)

    #根据'动补结构', '动宾关系', '介宾关系'，选择观点
    for item in parse_result:
        relation = item[7]
        target = int(item[6])
        index = int(item[0])
        if relation in self.reverse_relation and target in result and target not in related_words:
            self.addToResult(parse_result, index, result, related_words)

    # 加入关键词
    for item in parse_result:
        word = item[1]
        if word == key:
            result[int(item[0])] = word

    #对已经在result中的词，按照在句子中原来的顺序排列
    sorted_keys = sorted(result.items(), key=operator.itemgetter(0))
    selected_words = [w[1] for w in sorted_keys]
    return selected_words

通过这个方法，我们拿到了每个句子对应的观点了。下面对所有观点进行聚类。

2.1 观点提取效果

原句	观点
这个手机是正品吗？	手机是正品
礼品是一些什么东西？	礼品是什么东西
现在都送什么礼品啊	都送什么礼品
直接付款是怎么付的啊	付款是怎么付
如果不满意也可以退货的吧	不满意可以退货

3. 观点聚类

观点聚类的方法有几种：

直接计算2个观点的聚类。（我使用的方法）
把观点转化为向量，比较余弦距离。

我的方法是用difflib对任意两个观点进行聚类。我的时间复杂度很高\(O(n^2)\)，用一个小技巧优化了下。代码如下：

def extractor(self):
    de = DependencyExtraction()
    opinionList = OpinionCluster()
    for sent in self.sentences:
        keyword = ""
        if not self.keyword:
            keyword = ""
        else:
            checkSent = []
            for word in self.keyword:
                if sent not in checkSent and word in sent:
                    keyword = word
                    checkSent.append(sent)
                    break

        opinion = "".join(de.parseSentWithKey(sent, keyword))
        if self.filterOpinion(opinion):
            opinionList.addOpinion(Opinion(sent, opinion, keyword))


    '''
        这里设置两个阈值，先用小阈值把一个大数据切成小块，由于是小阈值，所以本身是一类的基本也能分到一类里面。
        由于分成了许多小块，再对每个小块做聚类，聚类速度大大提升，thresholds=[0.2, 0.6]比thresholds=[0.6]速度高30倍左右。
        但是[0.2, 0.6]和[0.6]最后的结果不是一样的，会把一些相同的观点拆开。
    '''
    thresholds = self.json_config["thresholds"]
    clusters = [opinionList]
    for threshold in thresholds:
        newClusters = []
        for cluster in clusters:
            newClusters += self.clusterOpinion(cluster, threshold)
        clusters = newClusters

    resMaxLen = {}
    for oc in clusters:
        if len(oc.getOpinions()) >= self.json_config["minClusterLen"]:
            summaryStr = oc.getSummary(self.json_config["freqStrLen"])
            resMaxLen[summaryStr] = oc.getSentences()

    return self.sortRes(resMaxLen)

3.1 观点总结

对聚类在一起的观点，提取一个比较好的代表整个聚类的观点。

我的方法是对聚类观点里面的所有观点进行字的频率统计，对高频的字组成的字符串去和所有观点计算相似度，相似度最高的那个当做整个观点聚类的总的观点。

def getSummary(self, freqStrLen):
    opinionStrs = []
    for op in self._opinions:
        opinion = op.opinion
        opinionStrs.append(opinion)

    # 统计字频率
    word_counter = collections.Counter(list("".join(opinionStrs))).most_common()

    freqStr = ""
    for item in word_counter:
        if item[1] >= freqStrLen:
            freqStr += item[0]

    maxSim = -1
    maxOpinion = ""
    for opinion in opinionStrs:
        sim = similarity(freqStr, opinion)
        if sim > maxSim:
            maxSim = sim
            maxOpinion = opinion

    return maxOpinion

3.2 观点总结效果

聚类总结	所有观点
手机是全新正品	手机是全新正品手机是全新手机是不是正品保证是全新手机
能送无线充电器	能送无线充电器人家送无线充电器送无线充电器买能送无线充电器
可以优惠多少	可以优惠多少你好可优惠多少能优惠多少可以优惠多少
是不是翻新机	是不是翻新机不会是翻新机手机是还是翻新会不会是翻新机
花呗可以分期	花呗不够可以分期花呗分期可以可以花呗分期花呗可以分期
没有给发票	我没有发票发票有开给我没有给发票你们有给发票

4. 总结

以上我本人做的一些简单的观点提取和聚类，可以适用一些简单的场景中。

2. 观点提取和聚类代码详解

标签：bre set 如何 etop 复杂度 getter roo self .so

原文地址：https://www.cnblogs.com/huangyc/p/10279254.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行