今天研究一个Perl脚本,有几个正则非常不解:
$text =~ s/([?!]) +([\‘\"\(\[\?\?\p{IsPi}]*[\p{IsUpper}])/$1\n$2/g;
#multi-dots followed by sentence starters
$text =~ s/(\.[\.]+) +([\‘\"\(\[\?\?\p{IsPi}]*[\p{IsUpper}])/$1\n$2/g;
# add breaks for sentences that end with some sort of punctuation inside a quote or parenthetical and are followed by a possible sentence starter punctuation and upper case
$text =~ s/([?!\.][\ ]*[\‘\"\)\]\p{IsPf}]+) +([\‘\"\(\[\?\?\p{IsPi}]*[\ ]*[\p{IsUpper}])/$1\n$2/g;
# add breaks for sentences that end with some sort of punctuation are followed by a sentence starter punctuation and upper case
$text =~ s/([?!\.]) +([\‘\"\(\[\?\?\p{IsPi}]+[\ ]*[\p{IsUpper}])/$1\n$2/g;
其中\p后面的字符代表了一个unicode属性。也就是在perl里每个unicode编码都有一个独特的属性,我们可以根据它们各自的unicode属性找到匹配的字符。
关于unicode属性的介绍如下:
http://shouce.jb51.net/perl/PatternMatching.html
http://blog.csdn.net/wushuai1346/article/details/7206749
http://perldoc.perl.org/perluniprops.html
版权声明:本文为博主原创文章,未经博主允许不得转载。
原文地址:http://blog.csdn.net/lampqiu/article/details/47317951