码迷,mamicode.com
首页 > 其他好文 > 详细

爬虫入门-5-2.scrapy框架下载图片

时间:2019-03-17 10:19:42      阅读:153      评论:0      收藏:0      [点我收藏+]

标签:设置   定义   str   路径   store   res   car   通过   配置文件   

scrapy startproject bmw

cd bmw

scrapy genspider bmw5 ‘autohome.com.cn‘

第一种方式:不使用ImagePipeline

bww5.py:

 1 import scrapy
 2 from bmw.items import BmwItem
 3 
 4 
 5 class Bmw5Spider(scrapy.Spider):
 6     name = bmw5
 7     allowed_domains = [autohome.com.cn]
 8     start_urls = [https://car.autohome.com.cn/pic/series/65.html]
 9 
10     def parse(self, response):
11         uiboxs = response.xpath(//div[@class = "uibox"])[1:]
12         for uibox in uiboxs:
13             category = uibox.xpath(.//div[@class = "uibox-title"]/a/text()).get()
14             urls = uibox.xpath(.//ul/li/a/img/@src).getall()
15             urls = list(map(lambda url: response.urljoin(url), urls))
16             item = BmwItem(category=category, urls=urls)
17             yield item

items.py:

1 import scrapy
2 
3 
4 class BmwItem(scrapy.Item):
5     # define the fields for your item here like:
6     # name = scrapy.Field()
7     category=scrapy.Field()
8     urls=scrapy.Field()

settings.py部分设置:

1 ITEM_PIPELINES = {    
2      bmw.pipelines.BmwPipeline: 300,
3 }

pipelines.py:

 1 import os
 2 from urllib import request
 3 
 4 class BmwPipeline(object):
 5     def __init__(self):
 6         self.path = os.path.join(os.path.dirname(__file__), images)
 7         if not os.path.exists(self.path):
 8             os.mkdir(self.path)
 9 
10     def process_item(self, item, spider):
11         category = item[category]
12         urls = item[urls]
13         category_path = os.path.join(self.path, category)
14         if not os.path.exists(category_path):
15             os.mkdir(category_path)
16         for url in urls:
17             image_name = url.split(_)[-1]
18             request.urlretrieve(url, os.path.join(category_path, image_name))
19         return item

第二种:通过ImagesPipeline来保存图片

步骤:

1.定义好一个Item,然后在这个item中定义两个属性,分别为:image_urlsimages
images_urls是用来存储需要下载的图片的url链接,需要给一个列表

2.当文件下载完成后,会把文件下载相关信息存储到itemimages属性中,比如下载路径,下载的url和图片的校验码等
3.在配置文件settings.py中配置IMAGES_STORE,这个配置是用来设置图片下载下来的路径
在配置文件settings.py中配置IMAGES_URLS_FIELD,这个配置是设置图片路径的item字段名
(注:特别重要,不然图片文件夹为空)
4.启动pipeline:ITEM_PIPELINES中设置scrapy.pipelines.images.ImagesPipeline:1

改写pipelines.py:

 1 import os
 2 from scrapy.pipelines.images import ImagesPipeline
 3 from bmw import settings
 4 
 5 class BMWImagesPipeline(ImagesPipeline):  # 继承ImagesPipeline
 6     # 该方法在发送下载请求前调用,本身就是发送下载请求的
 7     def get_media_requests(self, item, info):
 8         request_objects = super(BMWImagesPipeline, self).get_media_requests(item, info)  # super()直接调用父类对象
 9         for request_object in request_objects:
10             request_object.item = item
11         return request_objects
12 
13     def file_path(self, request, response=None, info=None):
14         path = super(BMWImagesPipeline, self).file_path(request, response, info)
15         # 该方法是在图片将要被存储时调用,用于获取图片存储的路径
16         category = request.item.get(category)
17         images_stores = settings.IMAGES_STORE  # 拿到IMAGES_STORE
18         category_path = os.path.join(images_stores, category)
19         if not os.path.exists(category_path):  # 判断文件名是否存在,如果不存在创建文件
20             os.mkdir(category_path)
21         image_name = path.replace(full/, ‘‘)
22         image_path = os.path.join(category_path, image_name)
23         return image_path

改写settings.py:

1 import os
2 IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)), imgs)
3 IMAGES_URLS_FIELD=urls

4
ITEM_PIPELINES = {
5 ‘bmw.pipelines.BMWImagesPipeline‘: 1,
}

pycharm运行scrapy需要在项目文件夹下新建一个start.py:

1 from scrapy import cmdline
2 
3 cmdline.execute([scrapy, crawl, bmw5])

 

 

爬虫入门-5-2.scrapy框架下载图片

标签:设置   定义   str   路径   store   res   car   通过   配置文件   

原文地址:https://www.cnblogs.com/min-R/p/10545408.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!