标签:ict parse response text crawl 复制 callback awl ret
1.用 scrapy 新建一个 sun0769 项目
scrapy startproject sun0769
2.在 items.py 中确定要爬去的内容
1 import scrapy 2 3 4 class Sun0769Item(scrapy.Item): 5 # define the fields for your item here like: 6 # name = scrapy.Field() 7 problem_type = scrapy.Field() 8 title = scrapy.Field() 9 number = scrapy.Field() 10 content = scrapy.Field() 11 Processing_status = scrapy.Field() 12 url = scrapy.Field()
3.快速创建 CrawlSpider模板
scrapy genspider -t crawl dongguan wz.sun0769.com
注意 此时中的名称不能与项目名相同
4.打开 dongguan.py 编写代码
1 # -*- coding: utf-8 -*- 2 # 导入scrapy 模块 3 import scrapy 4 # 导入匹配规则类,用来提取符合规则的链接 5 from scrapy.linkextractors import LinkExtractor 6 # 导入CrawlSpiderl类和Rule 7 from scrapy.spiders import CrawlSpider, Rule 8 # 导入items中的类 9 from sun0769.items import Sun0769Item 10 11 class DongguanSpider(CrawlSpider): 12 name = ‘dongguan‘ 13 allowed_domains = [‘wz.sun0769.com‘] 14 start_urls = [‘http://d.wz.sun0769.com/index.php/question/huiyin?page=30‘] 15 pagelink = LinkExtractor(allow=r"page=\d+") 16 pagelink2 = LinkExtractor(allow=r"/question/\d+/\d+.shtml") 17 18 rules = ( 19 Rule(pagelink, follow=True ), 20 Rule(pagelink2, callback=‘parse_item‘,follow=True ), 21 22 ) 23 24 def parse_item(self, response): 25 #print response.url 26 item = Sun0769Item() 27 # xpath 返回是一个列表 28 #item[‘problem_type‘] = response.xpath(‘//a[@class="red14"]‘).extract() 29 item[‘title‘] = response.xpath(‘//div[@class="pagecenter p3"]//strong[@class="tgray14"]/text()‘).extract()[0].split(" ")[-1].split(":")[-1] 30 # item[‘title‘] = response.xpath(‘//div[@class="pagecenter p3"]//strong[@class="tgray14"]/text()‘).extract()[0] 31 item[‘number‘] = response.xpath(‘//div[@class="pagecenter p3"]// strong[@class="tgray14"]/text()‘).extract()[0].split(":")[1].split(" ")[0] 32 #item[‘content‘] = response.xpath().extract() 33 #item[‘Processing_status‘] = response.xpath(‘//div/span[@class="qgrn"]/text()‘).extract()[0] 34 # 把数据传出去 35 yield item 36 37
5.在piplines.py写代码
1 # -*- coding: utf-8 -*- 2 3 # Define your item pipelines here 4 # 5 # Don‘t forget to add your pipeline to the ITEM_PIPELINES setting 6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html 7 8 import json 9 10 class TencentPipeline(object): 11 def open_spider(self, spider): 12 self.filename = open("dongguan.json", "w") 13 14 def process_item(self, item, spider): 15 text = json.dumps(dict(item), ensure_ascii = False) + "\n" 16 self.filename.write(text.encode("utf-8") 17 return item 18 19 def close_spider(self, spider): 20 self.filename.close() 复制代码
6.在setting.py设置相关内容
问题:
1.怎么把不同页面的内容整合到一块
2.内容匹配还有些困难(xpath,re)
标签:ict parse response text crawl 复制 callback awl ret
原文地址:http://www.cnblogs.com/cuzz/p/7630314.html