爬虫爬取图片并下载

时间：2020-04-21 00:00:28 阅读：75 评论：0 收藏：0 [点我收藏+]


import requests
import re
import os
import time

"""获取主网页"""
web_page = ‘https://www.vmgirls.com/‘
headers = {
    ‘user-agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36‘
}
urls_response = requests.get(web_page,headers=headers)
urls_html = urls_response.text

"""解析主主网页获取下一层网页"""
all_urls = re.findall(‘https://.*?/\d*.html‘,urls_html)
urls = list(set(all_urls))
# print(urls)

"""下载下一页的网页图片"""
num_list = []
for url in urls:
    url_resopnse = requests.get(url,headers=headers)
    html=url_resopnse.text
    dir_name = re.findall(‘<h1 class="post-title h3">(.*?)</h1>‘,html)[-1]
    wget_urls = re.findall(‘https:.*?.jpeg‘,html)
    print("\033[32;1m %s upload %s pictures\033[0m" %(dir_name,len(wget_urls)))
    num = len(wget_urls)
    num_list.append(num)

    for wget_url in wget_urls:
        time.sleep(1)
        file_name = wget_url.split(‘/‘)[-1]
        print(file_name)
        dir_name = re.findall(‘<h1 class="post-title h3">(.*?)</h1>‘,html)[-1]
        if not os.path.exists(dir_name):
            os.mkdir(dir_name)
        response = requests.get(wget_url,headers=headers)
        with open(dir_name + ‘/‘ + file_name,‘wb‘) as f:
            f.write(response.content)

sum = 0
for i in num_list:
    sum += i
print("\033[31;1mThere are %s pictures that need to be crawled\033[0m" % (sum))

爬虫爬取图片并下载

标签：int tps url 爬虫 head pre htm not www

原文地址：https://blog.51cto.com/12629984/2488712

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行