Python爬取半次元图片[一]

时间：2017-09-17 17:31:43 阅读：201 评论：0 收藏：0 [点我收藏+]

标签：char until open enter 函数模块 ext 登陆注意

用到模块有requests,BeautifulSoup4,lxml(BeautifulSoup基于这个解析，据说速度会快很多)，re(正则ps.只用到了一个compile函数)

介绍下思路:

创建Img文件夹，解析html标题为文件夹名称（创建在Img文件夹下）,利用Firefox模块Firehug分析网页(这是需要自己动手分析，不是写代码)

接下来介绍一下使用的函数

re:

re.compile("%s"%(往里面填匹配字符就行))

BeautifulSoup:

BeautifulSoup()

find_all("a",attrs = {" ":re.compile("")}) 往里面填匹配属性例如 soup.find_all("a",attrs = {"a":re.compile("hz16")})

os:

os.path.exsists("") 填目录或者文件

os.makedirs("") 填目录

requests:

requests.get(url)　　　　可为https也可为http，自带的urllib我没办法get 到 https ，如果哪位dalao看到的话请指教，百度来的一堆没作用

urllib中的request:　　　　　　　　注意是request不是requests，别搞混了

requests.urlretrieve(url,filename,..) 　　　　有三个参数可选，第三个是进度，自行百度urlretrieve模块,第一个为目标链接，第二个为文件储存位置及文件名要处理目录名

本来想直接从主页抓取所有coser然后通过子链接下载，但是目标网站为动态网页，然后看了他们说要用webkit就没去了解了，虽然说程序员就应该对自己代码和用户负责，但是明天上学，实在肝不动。

技术分享以下为我代码，有很多不足之处，初始化并没有写太好，肝了一天多了，肝不动了，写个博客晚会游戏吧

获取实例链接中href属性

hrefs = soup.find_all("a",attrs = {"class":re.compile("fz16 l-left mr5 blue1")})

href = hrefs[0]["href"]

find只抓一条.

之后处理用字符串拼接成完整链接丢入urlretrieve

我还是要提醒一遍一定注意文件名要处理，不然就像我一样，明明昨天晚上就应该ok的，结果今天才完成

Title = Title.replace(" ","") 把空格替换掉

Title = Title.strip(),Title = Title.rstrip() 左右两边的换行空格去掉ps.我使用时不知道是windows的锅还是pycharm的锅，始终去不掉制表符，后来我使用了分片ps.根据实际情况而定

获取标题中的text像这样技术分享

Title = soup.find_all("h1",attrs = {"js-post-title"}).text

这里title就是标题了，需要处理，一下为我的demo，初始化没做好，下周末改进，给出百度盘文件链接

#coser网站图片获取  限制与\u\..


from bs4 import BeautifulSoup
import requests
from urllib import request
import re
import os
from random import randint


def Make_file():
    if os.path.exists("Daily_information.txt") == False:
        f = open("Daily_information.txt", "w")
        f.write("GET\n")
        f.close()


def Check_File():
    if os.path.isdir("Img") == False:
        os.makedirs("Img")


def Url_Write(url):                                                         #url日志系统
    if os.path.exists("Url_text.txt") == False:
        f = open("Url_text.txt","w")
        f.write("\n%s\n"%url)
        f.close()
    else:
        f = open("Url_text.txt","a")
        f.write("%s\n"%url)
        f.close()


def Url_geting(url=‘http://www.baidu.com‘, pat={"Mother": "fucker"}):               #网页缓存返回beautifulSoup对象
    buf = requests.get(url=url,params=pat)               # 读取网站
    try:
        html = BeautifulSoup(buf.text, "lxml")                                          # 使用BeautifulSoup解析
    except Exception as e:                                                               # 防止出错
        f = open("Daily_information.txt","a")
        f.write("%s:%s\n" % (url,e))
    return html

def Title_Get(html):
    Big_Title = html.find_all("h1",attrs = {"class":re.compile("js-post-title")})
    Title = Big_Title[0].text
    return Title

def Title_file_create(Title):                                                               #Title文件夹创建函数
    True_way = "%s"%Title
    os.makedirs(True_way)


def Title_Dispose(Title):                       #Title获取函数
    Title = Title[1:]
    Title = Title.split(":")
    Title = Title[-1]
    Title = Title.replace(" ","")
    return Title


def Img_Link_get(html):                                 #图片链接查找函数(估计只能用在半次元)
    Img_link = []
    Img_Face = html.find_all("img", attrs={"class": re.compile("detail_std")})
    for i in Img_Face:
        Img_link.append(i["src"])
    return Img_link

# 给出登陆用户,链接，获得html，解析html得到Img中href属性,获取Title处理后给做文件名

def Get_information(url = "https://bcy.net/coser/detail/13612/338282",pat = {"Test": "@1"}):
    html = Url_geting(url, pat=pat)
    The_link = Img_Link_get(html=html)
    Title = Title_Get(html)
    Title = Title_Dispose(Title)
    attrs = [url,html,The_link,Title]
    return attrs

def Get_Download(Img_links,path):                           #以后记得传参检查参数，此次bug为未处理传出参数中Title的空格
    if os.path.exists(path) == False:
        os.makedirs(path)
    step = 0
    for i in Img_links:
        step += 1
        request.urlretrieve(i,"%s\\%d.jpg"%(path,step))


pat = [{"门前大桥下":"游过一只鸭"},{"我爱北京天安门":"天安门上太阳升"},{"爱像":"一阵风"},{"吹完他就走":"~~"},{"辣妹儿":"法克儿"}]



if __name__ == "__main__":
    Check_File()
    Make_file()
    print("只可使用半次元coser页图片链接,按q再按enter退出")
    print("请输入链接：")
    while True:
        url = input()

        print("正在下载中....")
        weigth = len(pat)
        pat = pat[randint(0, weigth-1)]
        attrs = Get_information(url, pat)
        path = "Img\\%s" % attrs[3]
            #print(path + ‘1‘
        Get_Download(attrs[2], path=path)
        Url_Write(url=attrs[0])
        print("下载完毕...继续输入链接下载...按q + enter 退出")

妹子真美好，可惜我怎么还是单身了这么多年。

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　Life goes on,Until we die.

Python爬取半次元图片[一]

标签：char until open enter 函数模块 ext 登陆注意

原文地址：http://www.cnblogs.com/the-moon-so-beautiful/p/7536096.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行