码迷,mamicode.com
首页 > 编程语言 > 详细

Python 简单网页爬虫

时间:2020-01-18 00:45:19      阅读:76      评论:0      收藏:0      [点我收藏+]

标签:lxml   color   main   rom   ie 6   page   header   res   with   

网上的妹子图爬虫:只爬取一个人物相册

import requests
from bs4 import BeautifulSoup

headers = {
            User-Agent:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1),
            Referer:http://www.mzitu.com
        }
# 初始链接
start_url = https://www.mzitu.com/161470
start_html = requests.get(start_url,headers=headers)    #生成一个response对象
# print(start_html.text)                                #text是类型,如果是多媒体,则是content

soup = BeautifulSoup(start_html.content,lxml)

max_span=soup.find(div,class_=pagenavi).find_all(span)[-2].get_text()  

for page in range(1,int(max_span)+1):
    page_url = start_url+/+str(page)    #给初始链接加上页码数,就是某页的链接地址
    image_page = requests.get(page_url,headers=headers)    
    # print(image_page.text)
    image_soup = BeautifulSoup(image_page.content,lxml)    
    image_url = image_soup.find(div,class_=main-image).find(img)[src]   #找到img标签的src属性的值,如<img src=‘lslsls‘>,则返回lslsls
    name = str(image_url)      #别忘了转换类型
    #print(name)
    img = requests.get(name,headers = headers)
    fpath = C:\\Users\\wztshine\\Desktop\\新建文件夹\\+name[-7:]    #对name参数切片,从倒数第七个开始。
    with open(fpath, wb) as  f:                                  
        print(output:, fpath)
        f.write(img.content)

Python 简单网页爬虫

标签:lxml   color   main   rom   ie 6   page   header   res   with   

原文地址:https://www.cnblogs.com/wztshine/p/12207785.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!