码迷,mamicode.com
首页 > 其他好文 > 详细

scrapy 随机UserAgent

时间:2020-10-21 21:26:03      阅读:28      评论:0      收藏:0      [点我收藏+]

标签:cin   添加   efault   override   @class   ISE   nload   down   mozilla   

通过Scrapy的自有文件我们可以看到内置的UserAgent是如何设置的

scrapy.downloadermiddlewares.useragent.UserAgentMiddleware

"""Set User-Agent header per spider or use a default value from settings"""

from scrapy import signals

class UserAgentMiddleware:
    """This middleware allows spiders to override the user_agent"""

    def __init__(self, user_agent=‘Scrapy‘):
        self.user_agent = user_agent

    @classmethod
    def from_crawler(cls, crawler):
        o = cls(crawler.settings[‘USER_AGENT‘])
        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
        return o

    def spider_opened(self, spider):
        self.user_agent = getattr(spider, ‘user_agent‘, self.user_agent)

    def process_request(self, request, spider):
        if self.user_agent:
            request.headers.setdefault(b‘User-Agent‘, self.user_agent)

默认的配置

DOWNLOADER_MIDDLEWARES_BASE = {
    ...
   ‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware‘: 500,
   ...
}

首先我们先关闭之前的UserAgent的设置,并添加我们自己的UserAgent

USER_AGENT = [‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36‘, 
                ‘Mozilla/5.0 (X11; NetBSD) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36‘, 
                ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36‘, 
                ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36‘, 
                ‘Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36‘, 
                ‘Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36‘, 
                ‘Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36‘, 
                ‘Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/4E423F‘, 
                ‘Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/4E423F‘, 
                ‘Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36‘]


DOWNLOADER_MIDDLEWARES = {
     ‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware‘: None,
    ‘crawler.middlewares.RandomUserAgentMiddleware‘: 500,
}
from random import choice
class RandomUserAgentMiddlware(object):

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def process_request(self,request,spider):
        ua = choise(spider.settings["USER_AGENT"])
        request.headers.setdefault(b"User-Agent", ua)

或者直接使用封装好的安装包

pip install scrapy-fake-useragent
DOWNLOADER_MIDDLEWARES = {
    ‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware‘: None, # 关闭默认方法
    ‘scrapy_fake_useragent.middleware.RandomUserAgentMiddleware‘: 500, # 开启
}

scrapy 随机UserAgent

标签:cin   添加   efault   override   @class   ISE   nload   down   mozilla   

原文地址:https://www.cnblogs.com/iFanLiwei/p/13853685.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!