标签:port ref attr star yield first 需要 构造 elf
使用爬取http://quotes.toscrape.com/内容,网站内容很简单
使用scrapy创建本项目后,主要修改2个文件:items.py和spiders目录下的爬虫文件,这里名字为quotes.py
items.py用来保存爬取的数据,和字典的使用方法一样
import scrapy class Myscrapy1Item(scrapy.Item): # define the fields for your item here like: text = scrapy.Field() author = scrapy.Field() tags = scrapy.Field()
quotes.py
其中的parse函数负责解析start_urls返回的响应,提取数据以及进一步生成要处理的请求
# -*- coding: utf-8 -*- import scrapy from myscrapy1.items import Myscrapy1Item class QuotesSpider(scrapy.Spider): name = ‘quotes‘ allowed_domains = [‘quotes.toscrape.com‘] start_urls = [‘http://quotes.toscrape.com/‘] def parse(self, response): quotes = response.css(‘.quote‘) for quote in quotes: item = Myscrapy1Item() item[‘text‘] = quote.css(‘.text::text‘).extract_first() item[‘author‘] = quote.css(‘.author::text‘).extract_first() item[‘tags‘] = quote.css(‘.tags .tag::text‘).extract() yield item #获取多页内容 next = response.css(‘.pager .next a::attr("href")‘).extract_first() url = response.url.join(next) #生成绝对URL yield scrapy.Request(url=url, callback=self.parse) #构造请求时需要用scrapy.Request
标签:port ref attr star yield first 需要 构造 elf
原文地址:https://www.cnblogs.com/regit/p/9400542.html