Python爬虫框架Scrapy 学习笔记 4 ------- 第二个Scrapy项目

时间：2015-01-06 18:11:44 阅读：314 评论：0 收藏：0 [点我收藏+]

标签：scrapy

1. 任务一，抓取以下两个URL的内容，写入文件

http://www.dmoz.org/Computers/Programming/Languages/Python/Books/

http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/

项目截图

和上一个project不同的是，在spider中没有定义rules属性，而是定义了parse方法。这个方法告诉scrapy抓取start urls的内容后应该怎么做。第一个任务我们仅把内容写入文件。

2.任务二：在scrapy shell中练习使用xpath

在项目的顶级目录输入：

scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

有一个叫response的loval变量，存储了shell在加载的时候抓取的内容

练习简单的xpath

xpath方法返回的是一系列selector, 你可以继续调selector 的xpath方法，做更深入的挖掘

最后，使用ctrl + z 或输入两次quit推出shell

3.任务三，从response中选取 title, link和desc并输出到控制台。

为此需要改写我们的spider

import scrapy


class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = ["http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
                  "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"]


    def parse(self, response):
        for sel in response.xpath(‘//ul/li‘):
            title = sel.xpath(‘a/text()‘).extract()
            link = sel.xpath(‘a/@href‘).extract()
            desc = sel.xpath(‘text()‘).extract()
            print title, link, desc

4.任务四：将 title, link 和desc以json的形式写入文件

改写项目顶层目录的items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

改写spider

__author__ = ‘DB‘

import scrapy
from project002.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = ["http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
                  "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"]


    def parse(self, response):
        for sel in response.xpath(‘//ul/li‘):
            item = DmozItem()
            item[‘title‘] = sel.xpath(‘a/text()‘).extract()
            item[‘link‘] = sel.xpath(‘a/@href‘).extract()
            item[‘desc‘] = sel.xpath(‘text()‘).extract()
            yield item

再次运行spider:

scrapy crawl dmoz -o items.json

Python爬虫框架Scrapy 学习笔记 4 ------- 第二个Scrapy项目

标签：scrapy

原文地址：http://dingbo.blog.51cto.com/8808323/1599837

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行