本节主要学习python语言中网络相关知识。
一
主要文件和目录在Urllib的request.py模块下面。其中支持SSL加密方式访问。
下面我们看看其中的主要类和函数吧。
先看看源码吧。
def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, *, cafile=None, capath=None, cadefault=False): global _opener if cafile or capath or cadefault: if not _have_ssl: raise ValueError('SSL support not available') context = ssl._create_stdlib_context(cert_reqs=ssl.CERT_REQUIRED, cafile=cafile, capath=capath) https_handler = HTTPSHandler(context=context, check_hostname=True) opener = build_opener(https_handler) elif _opener is None: _opener = opener = build_opener() else: opener = _opener return opener.open(url, data, timeout)
直接利用URLOPEN函数进行web访问,主要传递的关键参数,就是网址的具体URL
import urllib.request if __name__ == '__main__': print('Main Thread Run :', __name__) ResponseData = urllib.request.urlopen('http://www.baidu.com/robots.txt') strData = ResponseData.read() strShow = strData.decode('utf-8') if(False): print(ResponseData.geturl()) if(False): print(ResponseData.info()) else: print(ResponseData.__sizeof__()) print(strShow) ResponseData.close() print('\nMain Thread Exit :', __name__)
结果如下
Main Thread Run : __main__ 32 User-agent: Baiduspider Disallow: /baidu Disallow: /s? Disallow: /ulink? Disallow: /link? User-agent: Googlebot Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: MSNBot Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: Baiduspider-image Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: YoudaoBot Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: Sogou web spider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: Sogou inst spider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: Sogou spider2 Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: Sogou blog Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: Sogou News Spider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: Sogou Orion spider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: ChinasoSpider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: Sosospider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: yisouspider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: EasouSpider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: * Disallow: / Main Thread Exit : __main__
函数urlretrieve可以实现直接传递URL地址读取该web网页内容,并且以本地文件存储。
函数返回值是一个list其中包括两个参数,第一个是本地存储文件名称,第二个是web服务
返回的http响应头
def urlretrieve(url, filename=None, reporthook=None, data=None): """ Retrieve a URL into a temporary location on disk.
代码测试
import urllib.request if __name__ == '__main__': print('Main Thread Run :', __name__) data = urllib.request.urlretrieve('http://www.baidu.com/robots.txt', 'robots.txt') print('--filename--:', data[0]) print('--response--:', data[1]) print('\nMain Thread Exit :', __name__)
Main Thread Run : __main__ --filename--: robots.txt --response--: Date: Mon, 22 Sep 2014 08:08:05 GMT Server: Apache P3P: CP=" OTI DSP COR IVA OUR IND COM " Set-Cookie: BAIDUID=4FB847BEE916A0F72ABC5093271CD2BC:FG=1; expires=Tue, 22-Sep-15 08:08:05 GMT; max-age=31536000; path=/; domain=.baidu.com; version=1 Last-Modified: Thu, 17 Jul 2014 07:10:38 GMT ETag: "91e-4fe5e56791780" Accept-Ranges: bytes Content-Length: 2334 Vary: Accept-Encoding,User-Agent Connection: Close Content-Type: text/plain Main Thread Exit : __main__
函数request_host解析url中包含的主机地址 传入参数只有一个Request的对象实例
至于Request对象待会介绍。
下面看看该函数源代码
def request_host(request): """Return request-host, as defined by RFC 2965. Variation from RFC: returned value is lowercased, for convenient comparison. """ url = request.full_url host = urlparse(url)[1] if host == "": host = request.get_header("Host", "") # remove port, if present host = _cut_port_re.sub("", host, 1) return host.lower()
import urllib.request if __name__ == '__main__': print('Main Thread Run :', __name__) Req = urllib.request.Request('http://www.baidu.com/robots.txt') host = urllib.request.request_host(Req) print(host) print('\nMain Thread Exit :', __name__)
结果:
Main Thread Run : __main__ www.baidu.com Main Thread Exit : __main__
四
下面介绍该模块主要类Request类。注意啊,是大写的R别搞错了。
先看看源代码
class Request: def __init__(self, url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None): self.full_url = url self.headers = {} self.unredirected_hdrs = {} self._data = None self.data = data self._tunnel_host = None for key, value in headers.items(): self.add_header(key, value) if origin_req_host is None: origin_req_host = request_host(self) self.origin_req_host = origin_req_host self.unverifiable = unverifiable if method: self.method = method @property def full_url(self): if self.fragment: return '{}#{}'.format(self._full_url, self.fragment) return self._full_url @full_url.setter def full_url(self, url): # unwrap('<URL:type://host/path>') --> 'type://host/path' self._full_url = unwrap(url) self._full_url, self.fragment = splittag(self._full_url) self._parse() @full_url.deleter def full_url(self): self._full_url = None self.fragment = None self.selector = '' @property def data(self): return self._data @data.setter def data(self, data): if data != self._data: self._data = data # issue 16464 # if we change data we need to remove content-length header # (cause it's most probably calculated for previous value) if self.has_header("Content-length"): self.remove_header("Content-length") @data.deleter def data(self): self.data = None def _parse(self): self.type, rest = splittype(self._full_url) if self.type is None: raise ValueError("unknown url type: %r" % self.full_url) self.host, self.selector = splithost(rest) if self.host: self.host = unquote(self.host) def get_method(self): """Return a string indicating the HTTP request method.""" default_method = "POST" if self.data is not None else "GET" return getattr(self, 'method', default_method) def get_full_url(self): return self.full_url def set_proxy(self, host, type): if self.type == 'https' and not self._tunnel_host: self._tunnel_host = self.host else: self.type= type self.selector = self.full_url self.host = host def has_proxy(self): return self.selector == self.full_url def add_header(self, key, val): # useful for something like authentication self.headers[key.capitalize()] = val def add_unredirected_header(self, key, val): # will not be added to a redirected request self.unredirected_hdrs[key.capitalize()] = val def has_header(self, header_name): return (header_name in self.headers or header_name in self.unredirected_hdrs) def get_header(self, header_name, default=None): return self.headers.get( header_name, self.unredirected_hdrs.get(header_name, default)) def remove_header(self, header_name): self.headers.pop(header_name, None) self.unredirected_hdrs.pop(header_name, None) def header_items(self): hdrs = self.unredirected_hdrs.copy() hdrs.update(self.headers) return list(hdrs.items())
def __init__(self, url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None):注意里面几个关键参数 url 代表你要访问的URL地址,Data代表你要发送的POST数据,
headers表示你需要在http请求头中包含的头部信息字段
method代表使用GET还是POST方法。
默认是POST传递方式
<span style="font-size:12px;">Req = urllib.request.Request('http://www.baidu.com/robots.txt')</span>
例如增加一个User-Agent的字段头请求头
USER_AGENT = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} Req= urllib.request.Request(url='http://www.baidu.com/robots.txt‘, headers=USER_AGENT)
修改超时时间
import socket socket.setdefaulttimeout(10)#10s
下面介绍代理使用
代理配置和相关地址信息,必须在调用web访问服务之前进行。
使用代码示例如下:
import socket import urllib.request socket.setdefaulttimeout(10) # 10s if __name__ == '__main__': print('Main Thread Run :', __name__) proxy = urllib.request.ProxyHandler({'http':'http://www.baidu.com:8080'}) opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler) urllib.request.install_opener(opener) content = urllib.request.urlopen('http://www.baidu.com/robots.txt').read() print('\nMain Thread Exit :', __name__)
六: 错误异常处理
python的网络服务异常处理相关函数和使用。
主要是try和execpt语句块的使用。记住重要一点
python的异常处理语句,最好是一行代码一抛出一捕捉
示例:
try : reqUrl = urllib.request.Request(url='http://www.baidu.com/robots.txt', headers=USER_AGENT) except HTTPError: print('urllib.error.HTTPError') except URLError: print('urllib.error.URLError') except OSError: print('urllib.error.OSError') try : responseData = urllib.request.urlopen(reqUrl) except HTTPError: print('urllib.error.HTTPError') except URLError: responseData.close() print('urllib.error.URLError') except OSError: print('urllib.error.OSError') try : pageData = responseData.read() except HTTPError: responseData.close() print('urllib.error.HTTPError') except URLError: responseData.close() print('urllib.error.URLError') except OSError: print('urllib.error.OSError') print(pageData) responseData.close()
七 说明
以上大概是基本的web访问服务函数和类的一些使用方法,当然还有很多方法和函数可以实现相同功能。
根据个人意愿和需求进行调用。还有一个记得本人只是基础学习,整理笔记于此,方便菜鸟和自己后期查阅。
原文地址:http://blog.csdn.net/microzone/article/details/39476893