标签:print return scrapy ext user start 随机 rom ...
Downloader Middleware有三个核心的方法
process_request(request, spider)
process_response(request, response, spider)
process_exception(request, exception, spider)
方法一:修改settings里面的USER_AGENT变量,加一行USER_AGENT = ‘....‘即可
方法二:修改middleware.py,这里实现得到一个随机的user-agent,在里面定义一个RandomUserAgentMiddleware类,并写一个process_request()函数
在middleware.py中定义一个process_response()函数
scrapy startproject httpbintest
cd httpbintest && scrapy genspider httpbin httpbin.org
-*- coding: utf-8 -*- import scrapy class HttpbinSpider(scrapy.Spider): name = ‘httpbin‘ allowed_domains = [‘httpbin.org‘] start_urls = [‘http://httpbin.org/get‘] def parse(self, response): # print(response.text) self.logger.debug(response.text) self.logger.debug(‘status code: ‘ + str(response.status))
其中的process_request函数是得到一个随机的user-agent; process_response函数是修改网页返回码为201
import random class RandomUserAgentMiddleware(): def __init__(self): self.user_agents = [ ‘Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)‘, ‘Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2‘, ‘Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1‘ ] def process_request(self, request, spider): request.headers[‘User-Agent‘] = random.choice(self.user_agents) def process_response(self, request, response, spider): response.status = 201 return response
DOWNLOADER_MIDDLEWARES = { ‘httpbintest.middlewares.RandomUserAgentMiddleware‘: 543, }
标签:print return scrapy ext user start 随机 rom ...
原文地址:https://www.cnblogs.com/regit/p/9406279.html