又是采集绿色下载站的最新软件,又是采用另一种方式(前两种是采用正则和xpath),呵呵
感觉有点像孔乙已的茴字有几种写法了
这回用CrawlSpider,Rule来配合采集
这次不用生成许多start_urls列表项了,可以按规则来自动读取,贴核心代码
# -*- coding: utf-8 -*- from scrapy.contrib.spiders import CrawlSpider,Rule from scrapy.selector import Selector from scrapy.http import Request from scrapy.contrib.linkextractors import LinkExtractor class MySpider(CrawlSpider): name = "downg" allowed_domains = ["downg.com"] start_urls = [ ‘http://www.downg.com/new/0_1.html‘ ] rules = [ Rule(LinkExtractor(allow=(‘/new/0_\d\.html‘), restrict_xpaths=(‘//div[@class="pages"]‘)), callback=‘parse_pages‘, follow=True) ] def parse_pages(self, response): sel=Selector(response) urlsReqs=[] urls_list=sel.xpath(‘//span[@class="app-name"]/a/@href‘).extract() print len(urls_list),urls_list for url in urls_list: req=Request(url,self.getDetail) urlsReqs.append(req) return urlsReqs def getDetail(self,response): print response.url
关键点解析:
LinkExtractor的allow定位列表页所匹配的正则
scrapy采集列表页的另一种方式,布布扣,bubuko.com
原文地址:http://webscrapy.blog.51cto.com/8343567/1534966