Python3 中日语料分句实现

时间：2018-04-27 13:23:24 阅读：193 评论：0 收藏：0 [点我收藏+]

标签：python mpi 方法 ret 决定队列 err 中间 block

0. 背景

因为最近在看平行语料句对齐、词对齐的缘故，想做对齐的话需要先做一个分句。
一开始利用正则和引号开关标志写了一种方法，中间想到一个小技巧，写出来比较简单通用，想把这一小段代码分享一下。

1. 原理

一些情况下，标点也算是比较好的一个特征，这里就想尽量正确的切分。
主要考虑到的问题包括：

分隔符保留
引号内的句子
同一处多个标点

决定引号内不切分之后，利用一点小技巧使得思路非常明确：
将括号内的做为整体保存到一个队列，同时用一个标志占位。
分好以后再替换回来。

2. 代码

注意，此处利用一个零宽的正则做为分割标志，但re.split()无法用其做分隔，会产生ValueError。

def my_split(string):
    """
    将引号内看作整体保存与队列，后面再换回
    省略号暂时不加
    # todo 可以考虑说话部分的分句，
    # 例如‘xxx：“xxx。”xx，xxxx。’
    # 还可分。
    """
    SPLIT_SIGN = ‘%%%%‘  # 需要保证字符串内本身没有这个分隔符

    # 替换的符号用: $PACK$
    SIGN = ‘$PACK$‘
    search_pattern = re.compile(‘\$PACK\$‘)
    pack_pattern = re.compile(‘(“.+?”|（.+?）|《.+?》|〈.+?〉|[.+?]|【.+?】|‘.+?’|「.+?」|『.+?』|".+?"|\‘.+?\‘)‘)
    pack_queue = []
    pack_queue = re.findall(pack_pattern, string)
    string = re.sub(pack_pattern, SIGN, string)

    pattern = re.compile(‘(?<=[。？！])(?![。？！])‘)
    result = []
    while string != ‘‘:
        s = re.search(pattern, string)
        if s is None:
            result.append(string)
            break
        loc = s.span()[0]
        result.append(string[:loc])
        string = string[loc:]
    
    result_string = SPLIT_SIGN.join(result)
    while pack_queue:
        pack = pack_queue.pop(0)
        loc = re.search(search_pattern, result_string).span()
        result_string = result_string[:loc[0]] + pack + result_string[loc[1]:]

    return result_string.split(SPLIT_SIGN)

参考

使用 Python 实现中文分句
 github address （笨办法我也没删，总觉得像某道做过的算法题，但想不起来了。）

Python3 中日语料分句实现

标签：python mpi 方法 ret 决定队列 err 中间 block

原文地址：https://www.cnblogs.com/Comero/p/8932349.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行