其实我只是想试试爬取图片而已,先看看网页,需要爬的地方有两个,一是封面图,二是下载地址,挺简单的
Item定义:
import scrapy class TiantianmeijuItem(scrapy.Item): name = scrapy.Field() image_urls = scrapy.Field() images = scrapy.Field() image_paths = scrapy.Field() episode = scrapy.Field() episode_url = scrapy.Field()
name是保存名字
image_urls和images 是爬取图片的pipeline用的,一个是保存图片URL,一个是保存图片存放信息
image_paths其实没什么实际作用,只是记录下载成功的图片地址
epiosde和episode_url是保存集数和对应下载地址
Spider:
import scrapy from tiantianmeiju.items import TiantianmeijuItem import sys reload(sys) # Python2.5 初始化后会删除 sys.setdefaultencoding 这个方法,我们需要重新载入 sys.setdefaultencoding(‘utf-8‘) class CacthUrlSpider(scrapy.Spider): name = ‘meiju‘ allowed_domains = [‘cn163.net‘] start_urls = ["http://cn163.net/archives/{id}/".format(id=id) for id in [‘16355‘, ‘13470‘, ‘18766‘, ‘18805‘]] def parse(self, response): item = TiantianmeijuItem() item[‘name‘] = response.xpath(‘//*[@id="content"]/div[2]/div[2]/h2/text()‘).extract() item[‘image_urls‘] = response.xpath(‘//*[@id="entry"]/div[2]/img/@src‘).extract() item[‘episode‘] = response.xpath(‘//*[@id="entry"]/p[last()]/a/text()‘).extract() item[‘episode_url‘] = response.xpath(‘//*[@id="entry"]/p[last()]/a/@href‘).extract() yield item
页面比较简单
Pipelines:这里写了两个管道,一个是把下载链接保存到文件,一个是下载图片
import json import os from scrapy.pipelines.images import ImagesPipeline from scrapy.exceptions import DropItem from scrapy.http import Request from settings import IMAGES_STORE class TiantianmeijuPipeline(object): def process_item(self, item, spider): return item class WriteToFilePipeline(object): def process_item(self, item, spider): item = dict(item) FolderName = item[‘name‘][0].replace(‘/‘, ‘‘) downloadFile = ‘download_urls.txt‘ with open(os.path.join(IMAGES_STORE, FolderName, downloadFile), ‘w‘) as file: for name,url in zip(item[‘episode‘], item[‘episode_url‘]): file.write(‘{name}: {url}\n‘.format(name=name, url=url)) return item class MyImagesPipeline(ImagesPipeline): def get_media_requests(self, item, info): for image_url in item[‘image_urls‘]: yield Request(image_url, meta={‘item‘: item}) def item_completed(self, results, item, info): image_paths = [x[‘path‘] for ok,x in results if ok] if not image_paths: raise DropItem("Item contains no images") item[‘image_paths‘] = image_paths return item def file_path(self, request, response=None, info=None): item = request.meta[‘item‘] FolderName = item[‘name‘][0].replace(‘/‘, ‘‘) image_guid = request.url.split(‘/‘)[-1] filename = u‘{}/{}‘.format(FolderName, image_guid) return filename
get_media_requests和item_completed,因为默认的图片储存路径是
<IMAGES_STORE>/full/3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg,
我需要把full改成以美剧名字目录来保存,所以重写了file_path
settings打开pipelines相关配置:
ITEM_PIPELINES = { ‘tiantianmeiju.pipelines.WriteToFilePipeline‘: 2, ‘tiantianmeiju.pipelines.MyImagesPipeline‘: 1, } IMAGES_STORE = os.path.join(os.getcwd(), ‘image‘) # 图片存储路径 IMAGES_EXPIRES = 90 IMAGES_MIN_HEIGHT = 110 IMAGES_MIN_WIDTH = 110
爬下来之后就是这个效果了:
本文出自 “运维笔记” 博客,请务必保留此出处http://lihuipeng.blog.51cto.com/3064864/1713531
原文地址:http://lihuipeng.blog.51cto.com/3064864/1713531