爬取二重网页

时间：2017-10-06 00:24:42 阅读：311 评论：0 收藏：0 [点我收藏+]

标签：ict parse response text crawl 复制 callback awl ret

1.用 scrapy 新建一个 sun0769 项目

scrapy startproject sun0769

2.在 items.py 中确定要爬去的内容

 1 import scrapy
 2 
 3 
 4 class Sun0769Item(scrapy.Item):
 5     # define the fields for your item here like:
 6     # name = scrapy.Field()
 7     problem_type = scrapy.Field()
 8     title = scrapy.Field() 
 9     number = scrapy.Field() 
10     content = scrapy.Field() 
11     Processing_status = scrapy.Field()
12     url = scrapy.Field()

3.快速创建 CrawlSpider模板

scrapy genspider -t crawl dongguan wz.sun0769.com

注意此时中的名称不能与项目名相同

4.打开 dongguan.py 编写代码

 1 # -*- coding: utf-8 -*-
 2 # 导入scrapy 模块
 3 import scrapy
 4 # 导入匹配规则类，用来提取符合规则的链接
 5 from scrapy.linkextractors import LinkExtractor
 6 # 导入CrawlSpiderl类和Rule
 7 from scrapy.spiders import CrawlSpider, Rule
 8 # 导入items中的类
 9 from sun0769.items import Sun0769Item
10 
11 class DongguanSpider(CrawlSpider):
12     name = ‘dongguan‘
13     allowed_domains = [‘wz.sun0769.com‘]
14     start_urls = [‘http://d.wz.sun0769.com/index.php/question/huiyin?page=30‘]
15     pagelink = LinkExtractor(allow=r"page=\d+")
16     pagelink2 = LinkExtractor(allow=r"/question/\d+/\d+.shtml")
17 
18     rules = (
19         Rule(pagelink, follow=True ),
20         Rule(pagelink2, callback=‘parse_item‘,follow=True ),
21 
22     )
23 
24     def parse_item(self, response):
25         #print response.url 
26         item = Sun0769Item() 
27         # xpath 返回是一个列表
28         #item[‘problem_type‘] = response.xpath(‘//a[@class="red14"]‘).extract()
29         item[‘title‘] = response.xpath(‘//div[@class="pagecenter p3"]//strong[@class="tgray14"]/text()‘).extract()[0].split(" ")[-1].split(":")[-1]
30         # item[‘title‘] = response.xpath(‘//div[@class="pagecenter p3"]//strong[@class="tgray14"]/text()‘).extract()[0]
31         item[‘number‘] = response.xpath(‘//div[@class="pagecenter p3"]//    strong[@class="tgray14"]/text()‘).extract()[0].split("：")[1].split("  ")[0]
32         #item[‘content‘] = response.xpath().extract()
33         #item[‘Processing_status‘] = response.xpath(‘//div/span[@class="qgrn"]/text()‘).extract()[0]
34         # 把数据传出去
35         yield item
36         
37

5.在piplines.py写代码

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 
 8 import json
 9 
10 class TencentPipeline(object):
11     def open_spider(self, spider):
12         self.filename = open("dongguan.json", "w")
13 
14     def process_item(self, item, spider):
15         text = json.dumps(dict(item), ensure_ascii = False) + "\n"
16         self.filename.write(text.encode("utf-8")
17         return item
18 
19     def close_spider(self, spider):
20         self.filename.close()
复制代码

6.在setting.py设置相关内容

问题:

1.怎么把不同页面的内容整合到一块

2.内容匹配还有些困难（xpath，re）

爬取二重网页

标签：ict parse response text crawl 复制 callback awl ret

原文地址：http://www.cnblogs.com/cuzz/p/7630314.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行