码迷,mamicode.com
首页 > Web开发 > 详细

requests之headers 'Content-Type': 'text/html' 导致中文encoding错误 'ISO-8859-1'

时间:2017-10-26 16:56:10      阅读:325      评论:0      收藏:0      [点我收藏+]

标签:adapter   top   pil   dex   read   img   index   class   charset   

0.

 

1.参考

代码分析Python requests库中文编码问题

iso-8859是什么?  他又被叫做Latin-1或“西欧语言”

补丁:

import requests
def monkey_patch():
    prop = requests.models.Response.content
    def content(self):
        _content = prop.fget(self)
        if self.encoding == ISO-8859-1:
            encodings = requests.utils.get_encodings_from_content(_content)
            if encodings:
                self.encoding = encodings[0]
            else:
                self.encoding = self.apparent_encoding
            _content = _content.decode(self.encoding, replace).encode(utf8, replace)
            self._content = _content
        return _content
    requests.models.Response.content = property(content)
monkey_patch()

2.原因

In [291]: r = requests.get(http://cn.python-requests.org/en/latest/)

In [292]: r.headers.get(content-type)
Out[292]: text/html; charset=utf-8

In [293]: r.encoding
Out[293]: utf-8


In [294]: rc = requests.get(http://python3-cookbook.readthedocs.io/zh_CN/latest/index.html)

In [296]: rc.headers.get(content-type)
Out[296]: text/html

In [298]: rc.encoding
Out[298]: ISO-8859-1

response text 异常

In [312]: rc.text
Out[312]: u\n\n<!DOCTYPE html>\n<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->\n<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->\n<head>\n  <meta charset="ut
f-8">\n  \n  <meta name="viewport" content="width=device-width, initial-scale=1.0">\n  \n  <title>Python Cookbook 3rd Edition Documentation &mdash; python3-cookbook 2.0.0 \xe6\x96\x87\xe6\xa1\xa3</tit
le>\n  \n\n  \n  \n  \n  \n\n  \n\n  \n  \n    \n\n  \n\n  \n  \n\n  \n    <link rel="stylesheet" href="https://media.readthedocs.org/css/sphinx_rtd_theme.css" type="text/css" />\n  \n\n  \n        <l
ink rel="index" title="\xe7\xb4\xa2\xe5\xbc\x95"\n              href="genindex.html"/>\n        <link rel="search" title="\xe6\x90\x9c\xe7\xb4\xa2" href="search.html"/>\n        <link rel="copyright"
title="\xe7\x89\x88\xe6\x9d\x83\xe6\x89\x80\xe6\x9c\x89" href="copyright.html"/>\n    <link rel="top" title="python3-cookbook 2.0.0 \xe6\x96\x87\xe6\xa1\xa3" href="#"/>\n        <link rel="next" title

In [313]: rc.content
Out[313]: \n\n<!DOCTYPE html>\n<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->\n<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->\n<head>\n  <meta charset="utf
-8">\n  \n  <meta name="viewport" content="width=device-width, initial-scale=1.0">\n  \n  <title>Python Cookbook 3rd Edition Documentation &mdash; python3-cookbook 2.0.0 \xe6\x96\x87\xe6\xa1\xa3</titl
e>\n  \n\n  \n  \n  \n  \n\n  \n\n  \n  \n    \n\n  \n\n  \n  \n\n  \n    <link rel="stylesheet" href="https://media.readthedocs.org/css/sphinx_rtd_theme.css" type="text/css" />\n  \n\n  \n        <li
nk rel="index" title="\xe7\xb4\xa2\xe5\xbc\x95"\n              href="genindex.html"/>\n        <link rel="search" title="\xe6\x90\x9c\xe7\xb4\xa2" href="search.html"/>\n        <link rel="copyright" t
itle="\xe7\x89\x88\xe6\x9d\x83\xe6\x89\x80\xe6\x9c\x89" href="copyright.html"/>\n    <link rel="top" title="python3-cookbook 2.0.0 \xe6\x96\x87\xe6\xa1\xa3" href="#"/>\n        <link rel="next" title=

 

response headers有‘content-type‘而且没有charset而且有‘text‘,同时满足三个条件导致判定‘ISO-8859-1‘

参考文章说 python3 没有问题,实测有。

C:\Program Files\Anaconda2\Lib\site-packages\requests\utils.py

技术分享
def get_encoding_from_headers(headers):
    """Returns encodings from given HTTP Header Dict.

    :param headers: dictionary to extract encoding from.
    :rtype: str
    """

    content_type = headers.get(content-type)

    if not content_type:
        return None

    content_type, params = cgi.parse_header(content_type)

    if charset in params:
        return params[charset].strip("‘\"")

    if text in content_type:
        return ISO-8859-1
View Code

C:\Program Files\Anaconda2\Lib\site-packages\requests\adapters.py

class HTTPAdapter(BaseAdapter):
    def build_response(self, req, resp):
        # Set encoding.
        response.encoding = get_encoding_from_headers(response.headers)

3.解决办法

参考文章打补丁或:

    if resp.encoding == ISO-8859-1:
        encodings = requests.utils.get_encodings_from_content(resp.content)  #re.compile(r‘<meta.*?charset  #源代码没有利用这个方法
        if encodings:
            resp.encoding = encodings[0]
        else:
            resp.encoding = resp.apparent_encoding  #models.py  chardet.detect(self.content)[‘encoding‘] 消耗计算 # resp.text >>> if self.encoding is None: encoding = self.apparent_encoding
        print ISO-8859-1 changed to %s%resp.encoding

 

requests之headers 'Content-Type': 'text/html' 导致中文encoding错误 'ISO-8859-1'

标签:adapter   top   pil   dex   read   img   index   class   charset   

原文地址:http://www.cnblogs.com/my8100/p/requests_encoding_bug.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!