scrapy框架之(CrawlSpider)

时间：2019-03-02 23:47:47 阅读：247 评论：0 收藏：0 [点我收藏+]

标签：input 响应规则 .com spider rip 正则匹配 page dig

一.CrawlSpider简介

如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话，有几种实现方法？

方法一：基于Scrapy框架中的Spider的递归爬取进行实现（Request模块递归回调parse方法）。

方法二：基于CrawlSpider的自动爬取进行实现（更加简洁和高效）。

一.简介

　　CrawlSpider其实是Spider的一个子类，除了继承到Spider的特性和功能外，还派生除了其自己独有的更加强大的特性和功能。其中最显著的功能就是”LinkExtractors链接提取器“。Spider是所有爬虫的基类，其设计原则只是为了爬取start_url列表中网页，而从爬取到的网页中提取出的url进行继续的爬取工作使用CrawlSpider更合适。

二.使用

　　1.创建scrapy工程：scrapy startproject projectName

　　2.创建爬虫文件：scrapy genspider -t crawl spiderName www.xxx.com

　　　　--此指令对比以前的指令多了 "-t crawl"，表示创建的爬虫文件是基于CrawlSpider这个类的，而不再是Spider这个基类。

　　3.观察生成的爬虫文件

　　爬虫文件.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
#不再是引入spider,而是引入了crawlspider,还引入了LinkExtracor(连接提取器),Rule解析器

class ChoutiSpider(CrawlSpider):
    name = ‘chouti‘
    #allowed_domains = [‘www.xxx.com‘]
    start_urls = [‘https://dig.chouti.com/r/scoff/hot/1‘]

　　#allow后面跟着正则匹配,用正则去匹配符合的连接
　　#rule规则解析器则会去把提取器提取到的连接发起请求,并把获得的响应对象用回调函数去解析
　　#follow表示是否把连接解析器继续作用到提取到的url中(是否提取全站的url)
    rules = (
        Rule(LinkExtractor(allow=r‘Items/‘), callback=‘parse_item‘, follow=True),
    )

    def parse_item(self, response):
        item = {}
        #item[‘domain_id‘] = response.xpath(‘//input[@id="sid"]/@value‘).get()
        #item[‘name‘] = response.xpath(‘//div[@id="name"]‘).get()
        #item[‘description‘] = response.xpath(‘//div[@id="description"]‘).get()
        return item

　　案例一:(全站提取)

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ChoutiSpider(CrawlSpider):
    name = ‘chouti‘
    # allowed_domains = [‘www.xxx.com‘]
    start_urls = [‘https://dig.chouti.com/r/scoff/hot/1‘]
    #把这个单独写比较好看
    link=LinkExtractor(allow=r‘/r/scoff/hot/\d+‘)
    rules = (
        Rule(link,callback=‘parse_item‘, follow=False),
    )

    def parse_item(self, response):
        print(response)

#这样就可以迭代提取到我们想要的所有内容,因为其起始页的url为:https://dig.chouti.com/r/scoff/hot/1

　　案例二:(第一页没有数字编号的)

class ChoutiSpider(CrawlSpider):
    name = ‘chouti‘
    # allowed_domains = [‘www.xxx.com‘]
    start_urls = [‘https://www.qiushibaike.com/text/‘]
    #把这个单独写比较好看

    link=LinkExtractor(allow=r‘/text/page/\d+/‘)
    link1=LinkExtractor(allow=r‘/text/‘)
    rules = (
        Rule(link,callback=‘parse_item‘, follow=True),
        Rule(link1, callback=‘parse_item‘, follow=True),
    )

    def parse_item(self, response):
        print(response)



#注意观察器其实url:

https://www.qiushibaike.com/text/
#第一页没有数字表示

　　案例三:(正匹配会有很多相似的,限定开头或者结尾)

class ChoutiSpider(CrawlSpider):
    name = ‘chouti‘
    # allowed_domains = [‘www.xxx.com‘]
    start_urls = [‘https://www.qiushibaike.com/pic/‘]
    # 把这个单独写比较好看

　　#这边的?记得转义\　　
    link = LinkExtractor(allow=r‘/pic/page/\d+\?s=‘)
    link1 = LinkExtractor(allow=r‘/pic/$‘)  #提取第一页这个匹配会有很多其他的干扰,这些并不是我们想要的,要限定结尾$
    rules = (
        Rule(link, callback=‘parse_item‘, follow=True),
        Rule(link1, callback=‘parse_item‘, follow=True),
    )

    def parse_item(self, response):
        print(response)

　　注:如果allow没有为空,那就是匹配网页中所有的url

scrapy框架之(CrawlSpider)

标签：input 响应规则 .com spider rip 正则匹配 page dig

原文地址：https://www.cnblogs.com/tjp40922/p/10463506.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行