整站爬虫

时间：2018-11-27 22:07:59 阅读：137 评论：0 收藏：0 [点我收藏+]

目标爬取拉钩：

技术分享图片

先进入虚拟环境workon ……

先介绍一个命令

scrapy genspider --list
>>>
  basic
  crawl
  csvfeed
  xmlfeed

如果不指定默认是basic

新建拉勾网的爬虫

scrapy genspider -t crawl lagou www.lagou.com

CrawlSpider是scrapy提供一个通用Spider。在Spider里面，我们可以指定一些爬取规则来实现页面的提取，这些爬取规则由一个专门的数据结构Rule表示。Rule里面包含提取和跟进页面的配置，Spider会根据Rule来确定当前页面中的哪些连接需要继续爬取，哪些页面的爬取结果需要用哪个方法解析等。

rules = (
        Rule(LinkExtractor(allow=r‘Items/‘), callback=‘parse_job‘, follow=True),
    )

实例化类，回调函数就是类的方法名称的字符串形式。

CrawlSpider继承自Spider类。除了Spider类的所有方法和属性，他提供了一个非常重要的属性和方法。

1，rules，他是爬取规则属性，是包含一个或多个Rule对象的列表。每个Rule对爬取网站的动作都做了定义，CrawlSpider会读取rules的每一个Rule并进行解析。

2，parse_start_url()，他是一个可以重写的方法，当start_urls对应的Ruquest得到Response时，该方法被调用，他会分析Response并必须返回Item对象或者Request对象。

这里最重要的内容莫过于Rule的定义了，他的定义和参数如下所示：

class Rule(object):

    def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):
        self.link_extractor = link_extractor
        self.callback = callback
        self.cb_kwargs = cb_kwargs or {}
        self.process_links = process_links
        self.process_request = process_request
        if follow is None:
            self.follow = False if callback else True
        else:
            self.follow = follow

整站爬虫

标签：cto false res src XML 表示方法实例继承

原文地址：https://www.cnblogs.com/zhoulixiansen/p/10029050.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行