标签:min com 示例 ice 工程师 网络 调度 cookie stop
一般而言,抓取稍微正规一点的网站,都会有反爬虫的制约。反爬虫主要有以下几种方式:
今天我们先主要来讲一讲,如何应对第2条的反反爬虫,如何通过多IP抓取。
通过多IP爬虫,又分为以下几种形式:
1. ADSL拨号
我一般是在windows平台ADSL拨号,其他平台暂时没用过。windows平台拨号,我一般用python的代码为:
# -*- coding: utf-8 -*- import os g_adsl_account = {"name": u"宽带连接", "username": "xxxx", "password": "xxxx"} class Adsl(object): # ============================= # __init__ : name: adsl名称 # ============================= def __init__(self): self.name = g_adsl_account["name"] self.username = g_adsl_account["username"] self.password = g_adsl_account["password"] # ============================= # set_adsl : 修改adsl设置 # ============================= def set_adsl(self, account): self.name = account["name"] self.username = account["username"] self.password = account["password"] # ============================= # connect : 宽带拨号 # ============================= def connect(self): cmd_str = "rasdial %s %s %s" % (self.name, self.username, self.password) os.system(cmd_str) time.sleep(5) # ============================= # disconnect : 断开宽带连接 # ============================= def disconnect(self): cmd_str = "rasdial %s /disconnect" % self.name os.system(cmd_str) time.sleep(5) #============================= # reconnect : 重新进行拨号 #============================= def reconnect(self): self.disconnect() self.connect()
2. 路由器拨号
如果是局域网,带路由器的。直接调用windows的rasdial命令无法拨号时,这个时候可以模拟登陆路由器,控制路由器重新拨号,换IP,这其实是一种折中的办法,曲线救国。下面以登录小米路由器示例:
# -*- coding: utf-8 -*- import requests import urllib from Crypto.Hash import SHA import time import json import re import random import datetime class Adsl(): def __init__(self): self.host = ‘192.168.31.1/‘ self.username = ‘admin‘ self.password = ‘huangxin250‘ def connect(self): host = self.host homeRequest = requests.get(‘http://‘ + host + ‘/cgi-bin/luci/web/home‘) key = re.findall(r‘key: \‘(.*)\‘,‘, homeRequest.text)[0] mac = re.findall(r‘deviceId = \‘(.*)\‘;‘, homeRequest.text)[0] aimurl = "http://" + host + "/cgi-bin/luci/api/xqsystem/login" nonce = "0_" + mac + "_" + str(int(time.time())) + "_" + str(random.randint(1000, 10000)) pwdtext = self.password pwd = SHA.new() pwd.update(pwdtext + key) hexpwd1 = pwd.hexdigest() pwd2 = SHA.new() pwd2.update(nonce + hexpwd1) hexpwd2 = pwd2.hexdigest() data = { "logtype": 2, "nonce": nonce, "password": hexpwd2, "username": self.username } response = requests.post(url=aimurl, data=data, timeout=15) resjson = json.loads(response.content) token = resjson[‘token‘] webstop = urllib.urlopen(‘http://192.168.31.1/cgi-bin/luci/;stok=‘ + token + ‘/api/xqnetwork/pppoe_stop‘) #time.sleep(1) webstart = urllib.urlopen(‘http://192.168.31.1/cgi-bin/luci/;stok=‘ + token + ‘/api/xqnetwork/pppoe_start‘) date = datetime.datetime.now() nowtime = str(date)[:-10] print nowtime + ‘, congratulations, the IP is changed !‘
利用这个方法,就实现了用路由器换IP的目的。该方法的缺陷也是很明显的。就是不像第一种方法那样通用。基本上一个路由器就得编一套代码,属于定制代码。
3. 代理IP
代理IP是最常见的一种多IP爬虫方法。在请求Headers中加入代理IP地址,即可实现代理IP抓取。缺陷是爬取速度和代理IP的速度息息相关。而且好的IP费用较高,免费的速度普遍不高。
附上requests抓取携带代理IP和selenium抓取携带代理IP的代码。
requests:
# -*- coding: utf-8 -*- import requests reload(sys) sys.setdefaultencoding(‘utf-8‘) type = sys.getfilesystemencoding() s = requests.session() proxie = { ‘http‘ : ‘http://122.193.14.102:80‘ } url = ‘xxx‘ response = s.get(url, verify=False, proxies = proxie, timeout = 20) print response.text
selenium:
from selenium import webdriver from selenium.webdriver.common.proxy import Proxy from selenium.webdriver.common.proxy import ProxyType proxy = Proxy( { ‘proxyType‘: ProxyType.MANUAL, ‘httpProxy‘: ‘ip:port‘ } ) desired_capabilities = DesiredCapabilities.PHANTOMJS.copy() proxy.add_to_capabilities(desired_capabilities) driver = webdriver.PhantomJS( executable_path="/path/of/phantomjs", desired_capabilities=desired_capabilities ) driver.get(‘http://httpbin.org/ip‘) print driver.page_source driver.close()
标签:min com 示例 ice 工程师 网络 调度 cookie stop
原文地址:http://www.cnblogs.com/jxhd1/p/7798621.html