码迷,mamicode.com
首页 > Web开发 > 详细

爬取二重网页

时间:2017-10-06 00:24:42      阅读:311      评论:0      收藏:0      [点我收藏+]

标签:ict   parse   response   text   crawl   复制   callback   awl   ret   

1.用 scrapy 新建一个 sun0769 项目

scrapy startproject sun0769

2.在 items.py 中确定要爬去的内容

 1 import scrapy
 2 
 3 
 4 class Sun0769Item(scrapy.Item):
 5     # define the fields for your item here like:
 6     # name = scrapy.Field()
 7     problem_type = scrapy.Field()
 8     title = scrapy.Field() 
 9     number = scrapy.Field() 
10     content = scrapy.Field() 
11     Processing_status = scrapy.Field()
12     url = scrapy.Field() 

3.快速创建 CrawlSpider模板

scrapy genspider -t crawl dongguan wz.sun0769.com

注意  此时中的名称不能与项目名相同

4.打开 dongguan.py 编写代码

 1 # -*- coding: utf-8 -*-
 2 # 导入scrapy 模块
 3 import scrapy
 4 # 导入匹配规则类,用来提取符合规则的链接
 5 from scrapy.linkextractors import LinkExtractor
 6 # 导入CrawlSpiderl类和Rule
 7 from scrapy.spiders import CrawlSpider, Rule
 8 # 导入items中的类
 9 from sun0769.items import Sun0769Item
10 
11 class DongguanSpider(CrawlSpider):
12     name = dongguan
13     allowed_domains = [wz.sun0769.com]
14     start_urls = [http://d.wz.sun0769.com/index.php/question/huiyin?page=30]
15     pagelink = LinkExtractor(allow=r"page=\d+")
16     pagelink2 = LinkExtractor(allow=r"/question/\d+/\d+.shtml")
17 
18     rules = (
19         Rule(pagelink, follow=True ),
20         Rule(pagelink2, callback=parse_item,follow=True ),
21 
22     )
23 
24     def parse_item(self, response):
25         #print response.url 
26         item = Sun0769Item() 
27         # xpath 返回是一个列表
28         #item[‘problem_type‘] = response.xpath(‘//a[@class="red14"]‘).extract()
29         item[title] = response.xpath(//div[@class="pagecenter p3"]//strong[@class="tgray14"]/text()).extract()[0].split(" ")[-1].split(":")[-1]
30         # item[‘title‘] = response.xpath(‘//div[@class="pagecenter p3"]//strong[@class="tgray14"]/text()‘).extract()[0]
31         item[number] = response.xpath(//div[@class="pagecenter p3"]//    strong[@class="tgray14"]/text()).extract()[0].split("")[1].split("  ")[0]
32         #item[‘content‘] = response.xpath().extract()
33         #item[‘Processing_status‘] = response.xpath(‘//div/span[@class="qgrn"]/text()‘).extract()[0]
34         # 把数据传出去
35         yield item
36         
37         

5.在piplines.py写代码

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 
 8 import json
 9 
10 class TencentPipeline(object):
11     def open_spider(self, spider):
12         self.filename = open("dongguan.json", "w")
13 
14     def process_item(self, item, spider):
15         text = json.dumps(dict(item), ensure_ascii = False) + "\n"
16         self.filename.write(text.encode("utf-8")
17         return item
18 
19     def close_spider(self, spider):
20         self.filename.close()
复制代码

 

6.在setting.py设置相关内容


 

问题:

1.怎么把不同页面的内容整合到一块

2.内容匹配还有些困难(xpath,re)

 

爬取二重网页

标签:ict   parse   response   text   crawl   复制   callback   awl   ret   

原文地址:http://www.cnblogs.com/cuzz/p/7630314.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!