码迷,mamicode.com
首页 > 其他好文 > 详细

scrapy实例:爬取安居客租房信息

时间:2018-10-22 10:20:06      阅读:1050      评论:0      收藏:0      [点我收藏+]

标签:comm   shanghai   框架   status   receive   target   ensure   for   可视化   

本次爬取安居客网站,获取上海长宁区的租房信息,参考自:微信公众号

仍然是用scrapy框架构建爬虫,步骤:1.分析网页

                 2.items.py

                 3.spiders.py

                 4. pipelines.py

                 5.settings.py

  • 观察网页

                 上海长宁区租房信息: https://sh.zu.anjuke.com/fangyuan/changning/

  • items.py

      这里定义字段保存要爬取的信息

import scrapy
class AnjukespiderItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field()
price = scrapy.Field() rent_type = scrapy.Field() house_type = scrapy.Field() area = scrapy.Field() towards = scrapy.Field() floor = scrapy.Field() decoration = scrapy.Field() building_type = scrapy.Field() community = scrapy.Field()
  • spider.py

    这里编写爬虫文件,告诉爬虫要爬取什么,怎么爬取

import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from anjukeSpider.items import AnjukespiderItem


# 定义爬虫类
class anjuke(scrapy.spiders.CrawlSpider):
    #爬虫名称
    name = anjuke
    #爬虫起始网页
    start_urls = [https://sh.zu.anjuke.com/fangyuan/changning/]
    #爬取规则
    rules = (
                Rule(LinkExtractor(allow=rfangyuan/p\d+/), follow=True), #网页中包含下一页按钮,所以这里设置True爬取所有页面
                Rule(LinkExtractor(allow=rhttps://sh.zu.anjuke.com/fangyuan/\d{10}), follow=False, callback=parse_item),#网页里含有【推荐】的房源信息但不一定是我们想要的长宁区,所以设置False不跟进
            )
    #回调函数,主要就是写xpath路径,上一篇实例说过,这里就不赘述了
    def parse_item(self, response):
        item = AnjukespiderItem()
        # 租金
        item[price] = int(response.xpath("//ul[@class=‘house-info-zufang cf‘]/li[1]/span[1]/em/text()").extract_first())
        # 出租方式
        item[rent_type] = response.xpath("//ul[@class=‘title-label cf‘]/li[1]/text()").extract_first()
        # 户型
        item[house_type] = response.xpath("//ul[@class=‘house-info-zufang cf‘]/li[2]/span[2]/text()").extract_first()
        # 面积
        item[area] = int(response.xpath("//ul[@class=‘house-info-zufang cf‘]/li[3]/span[2]/text()").extract_first().replace(平方米,‘‘))
        # 朝向
        item[towards] = response.xpath("//ul[@class=‘house-info-zufang cf‘]/li[4]/span[2]/text()").extract_first()
        # 楼层
        item[floor] = response.xpath("//ul[@class=‘house-info-zufang cf‘]/li[5]/span[2]/text()").extract_first()
        # 装修
        item[decoration] = response.xpath("//ul[@class=‘house-info-zufang cf‘]/li[6]/span[2]/text()").extract_first()
        # 住房类型
        item[building_type] = response.xpath("//ul[@class=‘house-info-zufang cf‘]/li[7]/span[2]/text()").extract_first()
        # 小区
        item[community] = response.xpath("//ul[@class=‘house-info-zufang cf‘]/li[8]/a[1]/text()").extract_first()
        yield item
  • pipelines.py

    保存爬取的数据,这里只保存为json格式

    其实可以不写这部分,不写pipeline ,运行时加些参数:scrapy crawl anjuke -o anjuke.json -t json

                             scrapy crawl 爬虫名称 -o 目标文件名称 -t 保存格式

from scrapy.exporters import JsonItemExporter


class AnjukespiderPipeline(object):
    def __init__(self):
        self.file = open(zufang_shanghai.json, wb) #设置文件存储路径
        self.exporter = JsonItemExporter(self.file, ensure_ascii=False)
        self.exporter.start_exporting()

    def process_item(self, item, spider):
        print(write)
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        print("close")
        self.exporter.finish_exporting()
        self.file.close()
  • settings.py

    修改settings文件,使pipeline生效

    设置下载延迟,防止访问过快导致被网站屏蔽

ITEM_PIPELINES = {
    anjukeSpider.pipelines.AnjukespiderPipeline: 300,
}

DOWNLOAD_DELAY = 2 

 

  • 运行命令行,进入项目根目录,键入
    scrapy crawl [爬虫名称]
  • PS F:\ScrapyProject\anjukeSpider\anjukeSpider> scrapy crawl anjuke

    执行完成

    爬取到61条信息,json文件在指定路径已生成

  • 2018-10-22 09:02:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {downloader/request_bytes: 40861,
     downloader/request_count: 61,
     downloader/request_method_count/GET: 61,
     downloader/response_bytes: 1925879,
     downloader/response_count: 61,
     downloader/response_status_count/200: 61,
     finish_reason: finished,
     finish_time: datetime.datetime(2018, 10, 22, 1, 2, 55, 245128),
     item_scraped_count: 60,
     log_count/DEBUG: 122,
     log_count/INFO: 9,
     request_depth_max: 1,
     response_received_count: 61,
     scheduler/dequeued: 61,
     scheduler/dequeued/memory: 61,
     scheduler/enqueued: 61,
     scheduler/enqueued/memory: 61,
     start_time: datetime.datetime(2018, 10, 22, 1, 0, 29, 555537)}
    2018-10-22 09:02:55 [scrapy.core.engine] INFO: Spider closed (finished)

    爬虫到此完成,但爬取到的数据并不直观,还需对其做可视化处理(pyecharts模块),这部分另写一篇pyecharts使用

  • pyecharts官方文档:http://pyecharts.org/#/zh-cn/

 

scrapy实例:爬取安居客租房信息

标签:comm   shanghai   框架   status   receive   target   ensure   for   可视化   

原文地址:https://www.cnblogs.com/toheart/p/9828328.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!