python2.7
import urllib2 import ssl weburl = "https://www.douban.com/" webheader = { ‘Accept‘: ‘text/html, application/xhtml+xml, */*‘, # ‘Accept-Encoding‘: ‘gzip, deflate‘, ‘Accept-Language‘: ‘zh-CN‘, ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko‘, ‘DNT‘: ‘1‘, ‘Connection‘: ‘Keep-Alive‘, ‘Host‘: ‘www.douban.com‘ } context = ssl._create_unverified_context() req = urllib2.Request(url=weburl, headers=webheader) webPage = urllib2.urlopen(req, context=context) data = webPage.read().decode(‘utf-8‘) print data print type(data) print type(webPage) print webPage.geturl() print webPage.info() print webPage.getcode()
python 3.6
import urllib.request import ssl weburl = "https://www.douban.com/" webheader = { ‘Accept‘: ‘text/html, application/xhtml+xml, */*‘, # ‘Accept-Encoding‘: ‘gzip, deflate‘, ‘Accept-Language‘: ‘zh-CN‘, ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko‘, ‘DNT‘: ‘1‘, ‘Connection‘: ‘Keep-Alive‘, ‘Host‘: ‘www.douban.com‘ } context = ssl._create_unverified_context() req = urllib.request.Request(url=weburl, headers=webheader) webPage = urllib.request.urlopen(req,context=context) data = webPage.read().decode(‘utf-8‘) print(data) print(type(webPage)) print(webPage.geturl()) print(webPage.info()) print(webPage.getcode())
用爬虫爬取豆瓣,报错“SSL: CERTIFICATE_VERIFY_FAILED”,Python 升级到 2.7.9 之后引入了一个新特性,当使用urllib.urlopen打开一个 https 链接时,会验证一次 SSL 证书。而当目标网站使用的是自签名的证书时就会抛出此异常。
解决方案有如下两个:
1)使用ssl创建未经验证的上下文,在urlopen中传入上下文参数
import ssl
context = ssl._create_unverified_context()
webPage = urllib.request.urlopen(req,context=context)
2)全局取消证书验证
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
另外,如果用的是requests模块的get方法,里面有一个verify参数,将其设成False就可以了。
解决 ‘utf-8‘ codec can‘t decode byte 0x8b in position 1: invalid start byte
‘Accept-Encoding‘: ‘gzip, deflate‘,
这条信息代表本地可以接收压缩格式的数据,而服务器在处理时就将大文件压缩再发回客户端,IE在接收完成后在本地对这个文件又进行了解压操作。出错的原因是因为你的程序没有解压这个文件,所以删掉这行就不会出现问题了。