python-re之中文匹配

时间：2014-08-25 22:37:14 阅读：259 评论：0 收藏：0 [点我收藏+]

标签：des style blog color strong for ar div log

 1 #coding=utf-8
 2 import re
 3 import chardet#检测网页编码形式的模块
 4   
 5 p = re.compile(r‘\d+‘)  
 6 print p.findall(‘one1two2three3four4‘)  
 7 
 8 a="rewfd231321ewq21weqeqw"
 9 p=re.compile(r"(\d+)\D+(\d+)",re.S)
10 b=p.findall(a)
11 print b
12 
13 a=u"我爱@糗百，你呢"
14 print a
15 b=re.findall (u"(.+?)@糗百(.+)",a,re.S)
16 print b
17 for i in b:
18     for j in i:
19         print j

结果：

[‘1‘, ‘2‘, ‘3‘, ‘4‘]
[(‘231321‘, ‘21‘)] #findall的结果是[(),()]这种形式的，如果元组只有一个元素，则是["",""]这样子的
我爱@糗百，你呢
[(u‘\u6211\u7231‘, u‘\uff0c\u4f60\u5462‘)]
我爱
，你呢

——————————————————————————————————————————

如果不知道汉字文本的编码，比如说是一段网上爬来的文字（通常情况下就是不知道的）

 1 import re
 2 import chardet#检测网页编码形式的模块
 3   
 4 a="我爱@糗百，你呢"
 5 if isinstance(a, unicode) :
 6     pass
 7 else:
 8     codesty=chardet.detect(a)
 9     a=a.decode(codesty[‘encoding‘])
10 print a
11 b=re.findall (u"(.+?)@糗百(.+)",a,re.S)
12 print b
13 for i in b:
14     for j in i:
15         print j

则利用chardet这个模块得到它的编码，并将其转化为unicode

结果：

我爱@糗百，你呢
[(u‘\u6211\u7231‘, u‘\uff0c\u4f60\u5462‘)]
我爱
，你呢

当然，如果想双击,py在windows下演示，得到的字符串应该再加j.encode("GBK")

注意：处理中文前要将其转化为unicode，不要ascii码直接正则匹配，ascII码如何转Unicode?遇到再说吧~

python-re之中文匹配

标签：des style blog color strong for ar div log

原文地址：http://www.cnblogs.com/fkissx/p/3935875.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行