python3获取中文网页乱码的问题

时间：2015-02-02 12:19:17 阅读：582 评论：0 收藏：0 [点我收藏+]

标签：

在python3中读取网页的时候，会有乱码的问题，如果直接打开，会有错误

Traceback (most recent call last):
  File "E:/Source_Code/python34/HTMLParser_in_3.py", line 81, in <module>
    context = f.read()
UnicodeDecodeError: ‘gbk‘ codec can‘t decode byte 0xad in position 175: illegal multibyte sequence

然后发现用二进制方式打开（‘rb‘），就没有问题，但是处理的时候，就会bytes类型和str类型不兼容的错误，直接强类型转换，后续处理的时候又会获取不到任何东西。

在python3中的str的decode方法，做了改变，因为python3中全部用Unicode编码，str取消了decode方法。

上网查了相关资料，发现，二进制打开后，对于得到的bytes类型有decode方法可以转换为可处理的str。

/tmp/ python3
Python 3.2.3 (default, Feb 20 2013, 14:44:27) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> f1 = open("unicode.txt", ‘r‘).read()
>>> print(f1)
寒冷

>>> f2 = open("unicode.txt", ‘rb‘).read() #二进制方式打开
>>> print(f2)
b‘\xe5\xaf\x92\xe5\x86\xb7\n‘
>>> f2.decode()
‘寒冷\n‘
>>> f1.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: ‘str‘ object has no attribute ‘decode‘
>>>

python3获取中文网页乱码的问题

标签：

原文地址：http://www.cnblogs.com/jokervwer/p/4267218.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行