抓图小爬虫

时间：2016-12-18 23:10:13 阅读：222 评论：0 收藏：0 [点我收藏+]

标签：www imp header gen size item print img ssid

小伙伴有个需求，想识别图片中的数字。正好在研究这方面，所以先做个demo压压惊。

已知一个图片库，分析图片的url比较有规律，很easy，但抓图时发现一直跳转到认证页面，

应该少cookie，加上之搞定。

（在这里提醒句，如果扒图片遇到阻碍，通常都是有办法绕过，<为何大量网站不能抓取?爬虫突破封禁的6种常见方法>http://www.cnblogs.com/junrong624/p/5533655.html

还有这个《反爬虫四个基本策略》http://www.cnblogs.com/junrong624/p/5508934.html）

由于只是测试，所以随机抓了几百张，再啰嗦几句，请求最好不要太频繁，给服务器压力太大，多线程什么的就更要慎重，确实需要可以深夜人少的时候抓。

以下是代码，

import requests
import random
from bs4 import BeautifulSoup
import os
headers = {‘User-Agent‘:"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
           "Cookie":"PHPSESSID=6f37dmpn8m63gadp94d70amn15; fixlogin=1; Hm_lvt_8b50e88d65e01e75f6e24de31d23b934=1481994476; Hm_lpvt_8b50e88d65e01e75f6e24de31d23b934=1482055955"}
baidu_headers = {‘User-Agent‘:"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"}
photo_url = ‘URL‘#替换成目标url
count = 0
os.chdir("D:\huipao\\")
while count < 200:
    count = count + 1
    id = random.randint(1,100000)
    img_html = requests.get(photo_url+str(id), headers=headers)
    img_Soup = BeautifulSoup(img_html.text, ‘lxml‘)
    img_url = img_Soup.find(‘div‘, class_=‘item active‘).find(‘img‘)[‘src‘]
    print(img_url)
    img = requests.get(img_url, headers=baidu_headers)
    f = open(str(count) + ‘.jpg‘, ‘ab‘)
    f.write(img.content)
    f.close()
else:
    print(‘end‘)

接下的思路是
1：存一下每张图的识别出的号码值，可能会识别出多个
2：如何遍历这些号码？将这些号码排个序吧，然后顺序查找，hash什么的先算了。
keep on moving

抓图小爬虫

标签：www imp header gen size item print img ssid

原文地址：http://www.cnblogs.com/marszhw/p/6195484.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行