码迷,mamicode.com
首页 > 其他好文 > 详细

关于chardet的问题

时间:2018-12-16 16:45:17      阅读:697      评论:0      收藏:0      [点我收藏+]

标签:pre   module   ide   color   解码   有一个   stdin   net   查询   

1. 在得到一份网页请求的response中还有一个文件名字.

   file_name = b‘\xba\xe3\xcb\xb3\xd6\xda\x95N(300208)_\xcf\xd6\xbd\xf0\xc1\xf7\xc1\xbf\xb1\xed.xls‘

   然后利用chardet.detect来获取编码方式,得到的是‘GB2312‘,但是使用这个编码方式来解码,失败了. 

>>> s.decode(GB2312)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: gb2312 codec cant decode byte 0x95 in position 6: illegal multibyte sequence

   然后查询对应汉字的编码值\x95\x4e,并且在https://bianma.supfree.net/chaye.asp?id=6607,得到使用的是‘GBK’编码.

是‘GB2312‘的超集.使用‘GBK‘解码.结果正常不再出错.

>>> s.decode(gbk, errors=ignore)
恒顺众昇(300208)_现金流量表.xls

2.如果上面的情况还可以接受的话,那下面这个就有点不合理了.

>>> file_bname=b\xc2\xf5\xc8\xf0\xd2\xbd\xc1\xc6(300760)_\xc0\xfb\xc8\xf3\xb1\xed.xls
>>> chardet.detect(file_bname)[encoding]
>>> print(chardet.detect(file_bname)[encoding])
None
>>> file_bname.decode(gbk)
迈瑞医疗(300760)_利润表.xls
>>> file_bname.decode(gb2312)
迈瑞医疗(300760)_利润表.xls
>>>

 可以看到代码在python shell中得不到编码方式. 但是在scrapy中得到的是如下编码方式.

2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: CP932 Japanese prober hit error at byte 43
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: EUC-TW Taiwan prober hit error at byte 27
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: utf-8 not active
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: CP932 not active
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: EUC-JP Japanese confidence = 0.01
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: GB2312 Chinese confidence = 0.01
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: EUC-KR Korean confidence = 0.01
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: CP949 Korean confidence = 0.01
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: Big5 Chinese confidence = 0.01
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: EUC-TW not active
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1251 Russian confidence = 0.0
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: KOI8-R Russian confidence = 0.11814918824024898
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: ISO-8859-5 Russian confidence = 0.0
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: MacCyrillic Russian confidence = 0.01
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: IBM866 Russian confidence = 0.01
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: IBM855 Russian confidence = 0.0
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: ISO-8859-7 Greek confidence = 0.0
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1253 Greek confidence = 0.0
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: ISO-8859-5 Bulgairan confidence = 0.0
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1251 Bulgarian confidence = 0.0
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: TIS-620 Thai confidence = 0.0
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: ISO-8859-9 Turkish confidence = 0.35989894691932234
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1255 Hebrew confidence = 0.0
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1255 Hebrew confidence = 0.0
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1255 Hebrew confidence = 0.0
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1251 Russian confidence = 0.0
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: KOI8-R Russian confidence = 0.11814918824024898
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: ISO-8859-5 Russian confidence = 0.0
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: MacCyrillic Russian confidence = 0.01
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: IBM866 Russian confidence = 0.01
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: IBM855 Russian confidence = 0.0
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: ISO-8859-7 Greek confidence = 0.0
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1253 Greek confidence = 0.0
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: ISO-8859-5 Bulgairan confidence = 0.0
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1251 Bulgarian confidence = 0.0
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: TIS-620 Thai confidence = 0.0
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: ISO-8859-9 Turkish confidence = 0.35989894691932234
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1255 Hebrew confidence = 0.0
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1255 Hebrew confidence = 0.0
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1255 Hebrew confidence = 0.0
encoding: ISO-8859-9

导致decode出现的是乱码. 尽管这部分代码是用‘GBK‘编码的.

通过以上这两个例子可以看出, chardet这个module在判断上还是会出现不少偏差. 实际中还是需要注意.

关于chardet的问题

标签:pre   module   ide   color   解码   有一个   stdin   net   查询   

原文地址:https://www.cnblogs.com/zmiao/p/10126788.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!