用python写网络爬虫 -从零开始 1 编写第一个网络爬虫

时间：2017-10-08 21:25:33 阅读：217 评论：0 收藏：0 [点我收藏+]

标签：sts 在服务器 pycha 支持 proxy 服务器 download 简单的网站

本文从最简单的爬虫开始，通过添加检测下载错误，设置用户代理，设置网络代理，逐渐完善爬虫功能 。
首先 说明一下代码的使用方法 ：在python2.7 环境下，用命令行也可以，用Pycharm编辑也可以。通过定义函数，然后引用函数完成网页抓取
例如 ：  download （”HTTP：//www.baidu.com“）

        download1 （”HTTP：//www.baidu.com“）

        download2（”HTTP：//www.baidu.com“）




1.用三行代码  完成第一个最简单的网络爬虫 

import urllib2
import urlparse


def download1(url):
    """Simple downloader"""
    return urllib2.urlopen(url).read()

2.升级一下，编写出现下载错误的网络爬虫

def download2(url):
    """Download function that catches errors"""
    print ‘Downloading:‘, url
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print ‘Download error:‘, e.reason
        html = None
    return html
3.网页5xx错误一般发生在服务器端，给爬虫加上一个判断，当错误代码大于500小于600的时候继续下载2次，

def download3(url, num_retries=2):
    """Download function that also retries 5XX errors"""
    print ‘Downloading:‘, url
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print ‘Download error:‘, e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, ‘code‘) and 500 <= e.code < 600:
                # retry 5XX HTTP errors
                html = download3(url, num_retries-1)
    return html

4.设置用户代理
一般情况下，默认的网络爬虫会被一些网站封杀，这里设置了一个"wswp"为名称的网络代理

def download4(url, user_agent=‘wswp‘, num_retries=2):
    """Download function that includes user agent support"""
    print ‘Downloading:‘, url
    headers = {‘User-agent‘: user_agent}
    request = urllib2.Request(url, headers=headers)
    try:
        html = urllib2.urlopen(request).read()
    except urllib2.URLError as e:
        print ‘Download error:‘, e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, ‘code‘) and 500 <= e.code < 600:
                # retry 5XX HTTP errors
                html = download4(url, user_agent, num_retries-1)
    return html

5.支持代理
有时候我们需要用代理访问某个网站。比如，NTEflix屏蔽了美国以外的大多数国家。我们使用 requests 模块来实现网络代理的功能。

import urllib2
import urlparse

def download5(url, user_agent=‘wswp‘, proxy=None, num_retries=2):
    """Download function with support for proxies"""
    print ‘Downloading:‘, url
    headers = {‘User-agent‘: user_agent}
    request = urllib2.Request(url, headers=headers)
    opener = urllib2.build_opener()
    if proxy:
        proxy_params = {urlparse.urlparse(url).scheme: proxy}
        opener.add_handler(urllib2.ProxyHandler(proxy_params))
    try:
        html = opener.open(request).read()
    except urllib2.URLError as e:
        print ‘Download error:‘, e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, ‘code‘) and 500 <= e.code < 600:
                # retry 5XX HTTP errors
                html = download5(url, user_agent, proxy, num_retries-1)
    return html

用python写网络爬虫 -从零开始 1 编写第一个网络爬虫

标签：sts 在服务器 pycha 支持 proxy 服务器 download 简单的网站

原文地址：http://www.cnblogs.com/mrruning/p/7638377.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行