Python 简单网页爬虫

时间：2020-01-18 00:45:19 阅读：76 评论：0 收藏：0 [点我收藏+]

标签：lxml color main rom ie 6 page header res with

网上的妹子图爬虫：只爬取一个人物相册

import requests
from bs4 import BeautifulSoup

headers = {
            ‘User-Agent‘:‘Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)‘,
            ‘Referer‘:‘http://www.mzitu.com‘
        }
# 初始链接
start_url = ‘https://www.mzitu.com/161470‘
start_html = requests.get(start_url,headers=headers)    #生成一个response对象
# print(start_html.text)                                #text是类型，如果是多媒体，则是content

soup = BeautifulSoup(start_html.content,‘lxml‘)

max_span=soup.find(‘div‘,class_=‘pagenavi‘).find_all(‘span‘)[-2].get_text()  

for page in range(1,int(max_span)+1):
    page_url = start_url+‘/‘+str(page)    #给初始链接加上页码数，就是某页的链接地址
    image_page = requests.get(page_url,headers=headers)    
    # print(image_page.text)
    image_soup = BeautifulSoup(image_page.content,‘lxml‘)    
    image_url = image_soup.find(‘div‘,class_=‘main-image‘).find(‘img‘)[‘src‘]   #找到img标签的src属性的值，如<img src=‘lslsls‘>,则返回lslsls
    name = str(image_url)      #别忘了转换类型
    #print(name)
    img = requests.get(name,headers = headers)
    fpath = ‘C:\\Users\\wztshine\\Desktop\\新建文件夹\\‘+name[-7:]    #对name参数切片，从倒数第七个开始。
    with open(fpath, ‘wb‘) as  f:                                  
        print(‘output:‘, fpath)
        f.write(img.content)

Python 简单网页爬虫

标签：lxml color main rom ie 6 page header res with

原文地址：https://www.cnblogs.com/wztshine/p/12207785.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行