标签:.net 析取 lis pyc phantomjs href 定制 wow 一段
在爬取的过程中难免发生ip被封和403错误等等,这都是网站检测出你是爬虫而进行反爬措施,这里自己总结下如何避免
import time#导入包
time.sleep(3)#设置时间间隔为3秒
wait1.until(lambda driver: driver.find_element_by_xpath("//div[@id=‘link-report‘]/span"))
import urllib2
req = urllib2.Request(url)
#多了以下一这一步而已
req.add_header(‘User-Agent‘,‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36‘)
response = urllib2.urlopen(req)
# -*- coding: utf-8 -*-
import urllib2
url = "http://www.ip181.com/"
proxy_support = urllib2.ProxyHandler({‘http‘:‘121.40.108.76‘})
#参数是一个字典{‘类型‘:‘代理ip:端口号‘}
opener = urllib2.build_opener(proxy_support)
#定制opener
opener.add_handler=[(‘User-Agent‘,‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36‘)]
#add_handler给加上伪装
urllib2.install_opener(opener)
response = urllib2.urlopen(url)
print response.read().decode(‘gbk‘)
# -*- coding: utf-8 -*-
import urllib2
import random
ip_list=[‘119.6.136.122‘,‘114.106.77.14‘]
#使用一组ip调用random函数来随机使用其中一个ip
url = "http://www.ip181.com/"
proxy_support = urllib2.ProxyHandler({‘http‘:random.choice(ip_list)})
#参数是一个字典{‘类型‘:‘代理ip:端口号‘}
opener = urllib2.build_opener(proxy_support)
#定制opener
opener.add_handler=[(‘User-Agent‘,‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36‘)]
#add_handler给加上伪装
urllib2.install_opener(opener)
response = urllib2.urlopen(url)
print response.read().decode(‘gbk‘)
from selenium import webdriver
#from selenium.webdriver.remote.webelement import WebElement
url = ‘http://pythonscraping.com/pages/itsatrap.html‘
driver = webdriver.PhantomJS(executable_path="phantomjs.exe")
driver.get(url)
links = driver.find_elements_by_tag_name("a")
for link in links:
if not link.is_displayed():
print "the link "+link.get_attribute("href")+"is a trap"
fields = driver.find_elements_by_tag_name("input")
for field in fields:
if not field.is_displayed():
print "do not change value of "+field.get_attribute("name")
the link http://pythonscraping.com/dontgohereis a trap
do not change value of phone
do not change value of email
分布式爬取,针对比较大型爬虫系统,实现步骤如下所示1.基本的http抓取工具,如scrapy2.避免重复抓取网页,如Bloom Filter3.维护一个所有集群机器能够有效分享的分布式队列4.将分布式队列和Scrapy结合5.后续处理,网页析取(python-goose),存储(Mongodb)(知乎上看到的补充一下)
标签:.net 析取 lis pyc phantomjs href 定制 wow 一段
原文地址:http://www.cnblogs.com/tian-sun/p/7404439.html