python爬取基础网页图片

时间：2018-04-08 10:49:11 阅读：243 评论：0 收藏：0 [点我收藏+]

标签：status def 写入 targe rom path att markdown 地址

python基础爬虫总结

1.爬取信息原理

与浏览器客户端类似，向网站的服务器发送一个请求，该请求一般是url,也就是网址。之后服务器响应一个html页面给客户端，当然也有其他数据类型的信息，这些就是网页内容。我们要做的就是解析这些信息，然后选择我们想要的，将它爬取下来按要求写入到本地。

2. 爬虫基本流程

1.获取网页的响应的信息

这里有两个常用的方法

html = requests.get(url)
return html.text

或者

html = urllib.request.urlopen(url)
return html.read()

第一个get方法会返回一个Response对象，里面有服务器返回的所有信息，包括响应头，响应状态码等。直接输出html，只有这个<Response [200]>，要将信息提取出来有两个方法，content和text，content返回bytes型数据，text返回Unicode型数据（这种初级爬虫用什么都一样，编码什么的我还在研究-_-)，这里我们直接返回.text。
第二个方法我引用网上一句话：

urlopen打开URL网址，url参数可以是一个字符串url或者是一个Request对象，返回的是http.client.HTTPResponse对象.http.client.HTTPResponse对象大概包括read()、readinto()、getheader()、getheaders()、fileno()、msg、version、status、reason、debuglevel和closed函数，其实一般而言使用read()函数后还需要decode()函数，这里一个巨大的优势就是：返回的网页内容实际上是没有被解码或的，在read()得到内容后通过指定decode()函数参数，可以使用对应的解码方式。

2.解析网页内容

正则表达式是个很好的选择，但我不怎么会用。然而一个强大的第三方库给我提供了很大的帮助，Beautifulsoup。

soup = BeautifulSoup(html,‘html.parser)
urls = soup.find_all(‘div‘,attrs={‘class‘:‘bets-name‘})
print(urls[0])

BeautifulSoup给我们提供了很多方法，先创建一个soup实例，用html.parer自带解析器，也可以选lxml等。然后根据目标标签中的内容传入参数，找到目标标签，注意find_all返回的对象。

3.将信息下载到本地

如果是文本信息可以直接写入，图片信息的话就要再次访问图片链接，然后以content方法写入

3.爬取站酷图片

这里以Pycharm作为开发工具！

# coding: utf-8
# data: 2018/04/04
#target: Pictures on ZHANK

from bs4 import BeautifulSoup
import requests
import urllib.request

def get_html(url):
    html = requests.get(url)
    return html.text

def Download(html,filepath):
    soup = BeautifulSoup(html,‘html.parser‘)
    urls = soup.find_all(‘div‘,class_="imgItem maskWraper")
    count = 1

    try:
        for url in urls:
            img = url.find(‘img‘)
            print(img)
            img_url = img[‘data-original‘]
            req = requests.get(img_url)
            with open(filepath + ‘/‘ + str(count) + ‘.jpg‘, ‘wb‘) as f:                        #以二进制形式写入文件
                f.write(req.content)
            count += 1
            if count == 11:      #爬取十张图片就停止
                break
    except Exception as e:
        print(e)

def main():
    url = "http://www.hellorf.com/image/search/%E5%9F%8E%E5%B8%82/?utm_source=zcool_popular"  #目标网址
    filepath = "D://桌面/Python/study_one/Spider_practice/Spider_File/icon"                    #图片保存地址
    html = get_html(url)
    Download(html,filepath)

if __name__ == "__main__":
    main()

python爬取基础网页图片

标签：status def 写入 targe rom path att markdown 地址

原文地址：https://www.cnblogs.com/authetic/p/8743366.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行