Python 爬取美女图片，分目录多级存储

时间：2017-12-23 12:09:51 阅读：219 评论：0 收藏：0 [点我收藏+]

标签：美女也会 ons div pos detail image 文件 site

最近有个需求：下载https://mm.meiji2.com/网站的图片。

所以简单研究了一下爬虫。

在此整理一下结果，一为自己记录，二给后人一些方向。

爬取结果如图：

整体研究周期 2-3 天，看完之后，在加上看的时候或多或少也会自己搜到一些其他知识。

顺着看下来，应该会对爬虫技术有一个初步的认识。

大致的步骤：

分析页面，编写爬虫规则

下载图片，如果有分页，则分页

多页爬取，并且分目录保存到本地，多级存储。

应对反爬虫

以上就是学习的时候，看到的一些资料。

然后贴出一篇我自己写的，爬取的时候分了三级目录，并且，最后一级还有下一页。

import scrapy

from znns.items import ZnnsItem

class NvshenSpider(scrapy.Spider):

name = ‘znns‘

allowed_domains = [‘‘]

start_urls = [‘https://mm.meiji2.com/‘]

headers = {

‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,

‘Accept-Language‘: ‘zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3‘,

}

# 排行榜循环

def parse(self, response):

exp = u‘//div[@class="pagesYY"]//a[text()="下一页"]/@href‘  # 下一页的地址

_next = response.xpath(exp).extract_first()

yield scrapy.Request(response.urljoin(_next), callback=self.parse, dont_filter=True)

for p in response.xpath(‘//li[@class="rankli"]//div[@class="rankli_imgdiv"]//a/@href‘).extract():  # 某一个妹子简介详情页

item_page = "https://mm.meiji2.com/" + p + "album/"  # 拼接 全部相册页面

yield scrapy.Request(item_page, callback=self.parse_item, dont_filter=True)

# 单个介绍详情页

def parse_item(self, response):

item = ZnnsItem()

# 某个人的名字，也就是一级文件夹

item[‘name‘] = response.xpath(‘//div[@id="post"]//div[@id="map"]//div[@class="browse"]/a[2]/@title‘).extract()[

0].strip()

exp = ‘//li[@class="igalleryli"]//div[@class="igalleryli_div"]//a/@href‘

for p in response.xpath(exp).extract():  # 遍历妹子全部相册

item_page = "https://mm.meiji2.com/" + p  # 拼接图片的详情页

yield scrapy.Request(item_page, meta={‘item‘: item}, callback=self.parse_item_details, dont_filter=True)

# 图片主页，开始抓取

def parse_item_details(self, response):

item = response.meta[‘item‘]

item[‘image_urls‘] = response.xpath(‘//ul[@id="hgallery"]//img/@src‘).extract()  # 图片链接

item[‘albumname‘] = response.xpath(‘//h1[@id="htilte"]/text()‘).extract()[0].strip()  # 二级文件夹

yield item

new_url = response.xpath(‘//div[@id="pages"]//a[text()="下一页"]/@href‘).extract_first()  # 翻页

new_url = "https://mm.meiji2.com/" + new_url

Python 爬取美女图片，分目录多级存储

标签：美女也会 ons div pos detail image 文件 site

原文地址：http://www.cnblogs.com/darklx/p/8092401.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行