码迷,mamicode.com
首页 > 其他好文 > 详细

爬虫爬取图片并下载

时间:2020-04-21 00:00:28      阅读:75      评论:0      收藏:0      [点我收藏+]

标签:int   tps   url   爬虫   head   pre   htm   not   www   

import requests import re import os import time """获取主网页""" web_page = ‘https://www.vmgirls.com/‘ headers = { ‘user-agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36‘ } urls_response = requests.get(web_page,headers=headers) urls_html = urls_response.text """解析主主网页获取下一层网页""" all_urls = re.findall(‘https://.*?/\d*.html‘,urls_html) urls = list(set(all_urls)) # print(urls) """下载下一页的网页图片""" num_list = [] for url in urls: url_resopnse = requests.get(url,headers=headers) html=url_resopnse.text dir_name = re.findall(‘<h1 class="post-title h3">(.*?)</h1>‘,html)[-1] wget_urls = re.findall(‘https:.*?.jpeg‘,html) print("\033[32;1m %s upload %s pictures\033[0m" %(dir_name,len(wget_urls))) num = len(wget_urls) num_list.append(num) for wget_url in wget_urls: time.sleep(1) file_name = wget_url.split(‘/‘)[-1] print(file_name) dir_name = re.findall(‘<h1 class="post-title h3">(.*?)</h1>‘,html)[-1] if not os.path.exists(dir_name): os.mkdir(dir_name) response = requests.get(wget_url,headers=headers) with open(dir_name + ‘/‘ + file_name,‘wb‘) as f: f.write(response.content) sum = 0 for i in num_list: sum += i print("\033[31;1mThere are %s pictures that need to be crawled\033[0m" % (sum))

爬虫爬取图片并下载

标签:int   tps   url   爬虫   head   pre   htm   not   www   

原文地址:https://blog.51cto.com/12629984/2488712

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!