python 爬图

时间：2017-05-11 00:16:03 阅读：240 评论：0 收藏：0 [点我收藏+]

标签：mkdir head tool swp parse url http tieba res

利用bs库进行爬取，在下载html时，使用代理user_agent来下载，并且下载次数是2次，当第一次下载失败后，并且http状态码是500-600之间，然后会重新下载一次

soup = BeautifulSoup(html, "html.parser")
当前页面时html的
当当前页面时html5时

soup = BeautifulSoup(html, "html5lib")

#-*- coding:utf-8 -*-
import re
import urllib
import urllib2
import lxml.html
import itertools
import os
from bs4 import BeautifulSoup

def download(url,user_agent=‘wswp‘,num_try = 2):
    print ‘Downloading:‘,url

    headers = {‘User_agent‘:user_agent}
    request = urllib2.Request(url,headers=headers)
    try:
        html = urllib2.urlopen(request).read()
    except urllib2.URLError as e:
        print ‘Download error‘,e.reason
        html = None
        if num_try > 0:
            if hasattr(e,‘code‘) and 500 <= e.code <600:
                return download(url,user_agent,num_try-1)
    return html

def download_picture(url,path,name):
    if not os.path.isdir(path):
        os.mkdir(path)
    f = open(path+‘/‘ + name + ‘.jpg‘, ‘wb‘)
    f.write(download(url))
    f.close()
    
def bs_scraper(html):
    soup = BeautifulSoup(html, "html.parser")
    results = soup.find_all(name=‘img‘,attrs={‘class‘:‘BDE_Image‘})
    tt = 0
    for each in results:
        src = each.get(‘src‘)
        print src
        download_picture(src,‘/picture‘,str(tt))
        tt = tt + 1




url = ‘https://tieba.baidu.com/p/4693368072‘



html = download(url)

bs_scraper(html)

python 爬图

标签：mkdir head tool swp parse url http tieba res

原文地址：http://www.cnblogs.com/chenyang920/p/6838804.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行