爬取https网站

时间：2017-12-17 19:22:33 阅读：242 评论：0 收藏：0 [点我收藏+]

标签：back python2.7 客户端 user letter 完成豆瓣 gecko body

python2.7

import urllib2
import ssl

weburl = "https://www.douban.com/"
webheader = {
    ‘Accept‘: ‘text/html, application/xhtml+xml, */*‘,
    # ‘Accept-Encoding‘: ‘gzip, deflate‘,
    ‘Accept-Language‘: ‘zh-CN‘,
    ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko‘,
    ‘DNT‘: ‘1‘,
    ‘Connection‘: ‘Keep-Alive‘,
    ‘Host‘: ‘www.douban.com‘
}

context = ssl._create_unverified_context()
req = urllib2.Request(url=weburl, headers=webheader)
webPage = urllib2.urlopen(req, context=context)
data = webPage.read().decode(‘utf-8‘)
print data
print type(data)
print type(webPage)
print webPage.geturl()
print webPage.info()
print webPage.getcode()

python 3.6

import urllib.request
import ssl

weburl = "https://www.douban.com/"
webheader = {
    ‘Accept‘: ‘text/html, application/xhtml+xml, */*‘,
    # ‘Accept-Encoding‘: ‘gzip, deflate‘,
    ‘Accept-Language‘: ‘zh-CN‘,
    ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko‘,
    ‘DNT‘: ‘1‘,
    ‘Connection‘: ‘Keep-Alive‘,
    ‘Host‘: ‘www.douban.com‘
}

context = ssl._create_unverified_context()
req = urllib.request.Request(url=weburl, headers=webheader)
webPage = urllib.request.urlopen(req,context=context)
data = webPage.read().decode(‘utf-8‘)

print(data)
print(type(webPage))
print(webPage.geturl())
print(webPage.info())
print(webPage.getcode())

用爬虫爬取豆瓣，报错“SSL: CERTIFICATE_VERIFY_FAILED”，Python 升级到 2.7.9 之后引入了一个新特性，当使用urllib.urlopen打开一个 https 链接时，会验证一次 SSL 证书。而当目标网站使用的是自签名的证书时就会抛出此异常。

解决方案有如下两个：

　　1）使用ssl创建未经验证的上下文，在urlopen中传入上下文参数

import ssl

context = ssl._create_unverified_context()

webPage = urllib.request.urlopen(req,context=context)

2）全局取消证书验证

import ssl

ssl._create_default_https_context = ssl._create_unverified_context

另外，如果用的是requests模块的get方法，里面有一个verify参数，将其设成False就可以了。

解决 ‘utf-8‘ codec can‘t decode byte 0x8b in position 1: invalid start byte　　

‘Accept-Encoding‘: ‘gzip, deflate‘,

这条信息代表本地可以接收压缩格式的数据，而服务器在处理时就将大文件压缩再发回客户端，IE在接收完成后在本地对这个文件又进行了解压操作。出错的原因是因为你的程序没有解压这个文件，所以删掉这行就不会出现问题了。

爬取https网站

标签：back python2.7 客户端 user letter 完成豆瓣 gecko body

原文地址：http://www.cnblogs.com/domestique/p/8052686.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行