python爬虫的基本知识储备

时间：2018-08-23 20:09:11 阅读：160 评论：0 收藏：0 [点我收藏+]

1.关于引用全局变量：

　　引用全局变量并不是拿来就可以用，拿来就可以改的，当在子函数中引用全局变量的时候，应该声明这个变量是全局变量：如global test，全局变量test。具体：https://blog.csdn.net/my2010sam/article/details/17735159
2.关于寻找网页的原始图片：

　　一般来说，显示在网页上面的图片是经过压缩的缩略图片，但是我们想要爬取的却是高清的原图，那么这个时候我们就可以右键显示网页源码，到网页源码里面找，一般来说，都是可以找到原图的链接的，打个比方，百度图片的原图链接是在一个objURL的对象之下的，ctrl+f查找一下就可以找到了，其他的网站估计也差不多，仔细找就好

3.关于下一个网页链接：

　　有时候网页链接非常的长，比如百度图库的链接就是很臭很长，所以通过：观察网页规律然后传入参数构造下一个页面的链接，这个方法显然行不通。那么这个时候我们就要寻找另外一个方法了，那就是：右键先进入网页源码，然后在源码中检索页面当中显示的
“下一页“这样的词汇，还是拿百度图库来做例子：先右上角切换翻页模式，然后在网页源码当中检索。贴图如下：

技术分享图片

4.最后在说一下最重要的一个知识点，就是网页的中文解码：

　　当我们用requestes库的get函数请求成功之后，我们想把网页的源码保存下来，但是我们保存之后发现，网页源码当中的中文字符，不管怎么保存都是乱码的，这时候保存之前就要用上这个句子：r.encoding = r.apparent_encoding，r.apparent_encoding表示获取网页的正确编码方式，那么这句话得到意思就是让网页的编码方式等于他正确的编码方式（网上原话），然后在保存的时候with ope(‘file.txt‘,‘w‘,encoding = ‘utf-8‘) as f:.........。这样保存下来的文件就不会是中文乱码的了。

附上一段代码：

import os
import requests
import json
from hashlib import md5
from multiprocessing.pool import Pool
from pyquery import PyQuery as pq
from fake_useragent import  UserAgent
from urllib.parse import quote
import time
import re
url_list = []
page_num = 1
headers = {
        ‘User-Agent‘ : ‘ua.random()‘
    }
def get_one_page(url):
    global page_num
    ua = UserAgent()
    try :
        r = requests.get(url=url, headers=headers)
        if r.status_code == 200:
            print ("page %s status_code = %s" % (page_num,r.status_code))
            page_num = page_num + 1
            return r.text
    except requests.ConnectionError:
        return None

def get_image_list(html):
    global url_list
    image_list = []
    pattern_1 = re.compile(‘objURL":"(.*?)",‘,re.S)
    list = re.findall(pattern_1,html)
    for item in list:
        image_list.append(item)

    pattern_2 = re.compile(‘<strong><span class="pc"(.*?)<a href="(.*?)"><span class="pc" data="right"‘, re.S)
    list_2 = re.findall(pattern_2,html)
    next_url = ‘https://image.baidu.com‘ + ‘‘.join(list_2[0][1])
    url_list.append(next_url)
    return image_list

def save_image(image_list):
    if not os.path.exists(‘picture‘):
        os.mkdir(‘picture‘)
    try:
        for item in image_list:
            #print (item)
            response = requests.get(url = item,headers = headers)
            file_path = ‘{0}/{1}.{2}‘.format(‘picture‘, md5(response.content).hexdigest(),‘jpg‘)
            if not os.path.exists(file_path):
                with open(file_path,‘wb‘) as f:
                    f.write(response.content)
                    print ("success download: " + file_path)
            else :
                print ("already down" + file_path)
            time.sleep(2)
    except:
        print ("fail to download")
if __name__ == ‘__main__‘:
    keyword = input("输入要爬取的关键词：")   #要爬取的内容
    page = input("输入要爬取的页数：")      #要爬取的页数
    keyword = str(keyword)
    page = int(page)


    keyword = quote(keyword)
    url = ‘https://image.baidu.com/search/flip?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=result&fr=&sf=1&fmq=1535006333854_R&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&ctd=1535006333855%5E00_1903X943&word=‘ + keyword
    url_list.append(url)
    for each in range(page):
        html = get_one_page(url_list[each])
        print (url_list[each])
        image_list = get_image_list(html)
        #print (image_list)
        save_image(image_list)

View Code

python爬虫的基本知识储备

标签：query 获取构造 url cti 其他 imp 就是 save

原文地址：https://www.cnblogs.com/myxdashuaige/p/9525887.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行