标签:call cal coding 使用 tab 次数 port lis type
实现全站数据爬取
实现流程:在终端中执行
1. 创建工程: scrapy startporject + “项目名”
2. 进入项目
3. 创建爬虫文件:scrapy genspider -t crawl 项目名 www.xxx.com
LinkExtracor
链接提取器
可以根据指定的规则(allow=正则)进行链接的提取
link = LinkExtractor(allow=r'type=4&page=\d+')#提取页码链接
link_detail = LinkExtractor(allow=r'question/\d+/\d+\.shtml')
rules = (
#规则解析器
#作用:规则解析器可以将链接提取器提取到的链接进行请求发送且进行指定规则(callback)的数据解析
Rule(link, callback='parse_item', follow=False),
Rule(link_detail,callback='parse_detail')
)
Rule规则解析器
如何使用CrawlSpider
实现深度爬取
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from sunPro.items import SunproItem_content,SunproItem
class SunSpider(CrawlSpider):
name = 'sun'
# allowed_domains = ['www.xxx.com']
start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4&page=']
#实例化了一个链接提取器对象
#作用:可以根据指定的规则(allow=(正则))进行链接的提取
link = LinkExtractor(allow=r'type=4&page=\d+')#提取页码链接
link_detail = LinkExtractor(allow=r'question/\d+/\d+\.shtml')
rules = (
#规则解析器
#作用:规则解析器可以将链接提取器提取到的链接进行请求发送且进行指定规则(callback)的数据解析
Rule(link, callback='parse_item', follow=False),
Rule(link_detail,callback='parse_detail')
)
#该方法调用的次数请求的个数
def parse_item(self, response):
tr_list = response.xpath('//*[@id="morelist"]/div/table[2]//tr/td/table//tr')
for tr in tr_list:
title = tr.xpath('./td[2]/a[2]/@title').extract_first()
status = tr.xpath('./td[3]/span/text()').extract_first()
item = SunproItem()
item['title'] = title
item['status'] = status
yield item
def parse_detail(self,response):
content = response.xpath('/html/body/div[9]/table[2]//tr[1]').extract()
content = ''.join(content)
item = SunproItem_content()
item['content'] = content
yield item
items.py文件
import scrapy
class SunproItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
status = scrapy.Field()
class SunproItem_content(scrapy.Item):
content = scrapy.Field()
标签:call cal coding 使用 tab 次数 port lis type
原文地址:https://www.cnblogs.com/zhufanyu/p/12020532.html