码迷,mamicode.com
首页 > 其他好文 > 详细

反爬虫机制方法API

时间:2017-10-14 18:37:08      阅读:2837      评论:0      收藏:0      [点我收藏+]

标签:linux   ipo   oat   referer   request方法   trident   总结   linu   mit   

今天来总结下爬虫常见的反反爬取手段的方法,以后直接复制调用即可……^o^


1.设置User-Agent(随机获取)

  •    结合scrapy框架

        (1) 配置setting.py

             

  1. USER_AGNET_LIST = [
  2.    "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0; DigExt) ",
  3.    "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0; TUCOWS) ",
  4.    "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0; .NET CLR 1.1.4322) ",
  5.    "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 ) ",
  6.    "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; by TSG) ",
  7.    "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; .NET CLR 1.0.3705) ",
  8.    "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; .NET CLR 1.1.4322) ",
  9.    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; en) Opera 8.0 ",
  10.    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0) ",
  11.    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
  12.    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
  13.    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0",
  14.    "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
  15.    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
  16.    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
  17.    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
  18.    "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
  19.    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
  20.    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE",
  21.    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
  22.    "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
  23.    "Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
  24.    "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
  25.    "Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
  26.    "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
  27.    "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
  28.    "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
  29.    "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",

  30. "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
       "Mozilla/2.02E (Win95; U)",
       "Mozilla/3.01Gold (Win95; I)",
       "Mozilla/4.8 [en] (Windows NT 5.1; U)",
       "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
       "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
       "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
       "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
       "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
       "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
       "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
       "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
       "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
       "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
       "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
       "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
       "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
       "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
       "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
       "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
       "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
       "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
       "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
       "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
       "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
       "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
       "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3",
       "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
       "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
       "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
       "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
       "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
       "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
       "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
       "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
       "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
       "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",

  31. ]

      (2) 配置middlewares.py

       

  1. class RandomUserAgentMilleware(object):
  2.    def process_request(self, request, spider):
  3.        # 获取上一个随机的用户头
  4.        ua = random.choice(USER_AGNET_LIST)
  5.        # print (ua)
  6.        # 设置用户头参数
  7.        request.headers[‘User-Agent‘] = ua

process_request() 必须返回其中之一: 返回 None 、返回一个 Response 对象、返回一个 Request 对象或raise IgnoreRequest 。

如果其返回 None ,Scrapy将继续处理该request,执行其他的中间件的相应方法,直到合适的下载器处理函数(download handler)被调用, 该request被执行(其response被下载)。

如果其返回 Response 对象,Scrapy将不会调用 任何 其他的 process_request() 或 process_exception() 方法,或相应地下载函数; 其将返回该response。 已安装的中间件的 process_response() 方法则会在每个response返回时被调用。

如果其返回 Request 对象,Scrapy则停止调用 process_request方法并重新调度返回的request。当新返回的request被执行后, 相应地中间件链将会根据下载的response被调用。

如果其raise一个 IgnoreRequest 异常,则安装的下载中间件的 process_exception() 方法会被调用。如果没有任何一个方法处理该异常, 则request的errback(Request.errback)方法会被调用。如果没有代码处理抛出的异常, 则该异常被忽略且不记录(不同于其他异常那样)。

参数:
  • request (Request 对象) – 处理的request
  • spider (Spider 对象) – 该request对应的spider

2.设置IP代理

 代理网址获取:http://www.goubanjia.com/http://www.kuaidaili.com/

 (1)设置setting.py

             

  1. PROXY_LIST = [
  2.    {"ip_port": "121.41.8.23:16816", "user_passwd": "morganna_mode_g:ggc22qxp"}
  3. ]
  4. # 我这里的代理IP可能失效,自己换下IP和用户名密码即可,这里是IP代理池,我只放了一个做演示

   (2) 配置middlewares.py

      

  1. class RandomIPProxyMiddleware(object):
       def process_request(self,request,spider):
           # 随机获取一个ip代理
           proxy = random.choice(PROXY_LIST)
           # 判断是否有账号密码
           # if proxy.has_key(‘user_passwd‘):
           if ‘user_passwd‘ in proxy:
               # 对账号密码进行编码
               b64_user_pwd = base64.b64encode(proxy[‘user_passwd‘].encode())
               # 设置账号密码
               request.headers[‘Proxy-Authorization‘] = ‘Basic ‘ + b64_user_pwd.decode()
               # 使用代理
               request.meta[‘proxy‘] = ‘http://‘ + proxy[‘ip_port‘]
           else:
               # 免费代理
               request.meta[‘proxy‘] = ‘http://‘ + proxy[‘ip_port‘]

 最后别忘了----------

   设置setting中的中间件

  

  1. DOWNLOADER_MIDDLEWARES = {
  2.   # ‘Douban.middlewares.MyCustomDownloaderMiddleware‘: 543,
  3.   ‘Douban.middlewares.RandomUserAgentMilleware‘: 543,
  4.   ‘Douban.middlewares.RandomIPProxyMiddleware‘: 544,
  5. }

3.设置post请求模拟登录获取数据

  参考链接:http://www.cnblogs.com/adc8868/p/7256078.html

  思路:手动登录,获取响应查看formdata,模拟formdata结构进行请求

   

  1. def start_requests(self):   # 用start_requests()方法,代替start_urls
  2.        spider = Base_Spider(‘zhixing‘,[‘Host‘,‘Origin‘,‘Referer‘])
  3.        posturl = ‘http://zhixing.bjtu.edu.cn/member.php?mod=logging&action=login&loginsubmit=yes&infloat=yes&lssubmit=yes&inajax=1‘
  4.        postdata = {
  5.            ‘username‘:‘***‘,
  6.            ‘password‘:‘*****‘,
  7.            ‘quickforward‘:‘yes‘,
  8.            ‘handlekey‘:‘ls‘
  9.        }
  10.        cookies = spider.login(posturl,postdata)
  11.        url = ‘http://zhixing.bjtu.edu.cn/thread-1047622-1-1.html‘  # 别的页面
  12.   # 通过cookie获取信息
  13.        return [Request(url,cookies=cookies,callback=self.parse_page,headers=spider.headers)]
  14.    def parse_page(self,response):
  15.        sel = Selector(response)
  16.        r = sel.xpath(‘//td[@id="postmessage_10415551"]/text()‘).extract_first()  # 拿到数据
  17.        print r

补充下:

    有的网站会采用Referer设置防盗链来辨别爬虫,这时候我们需要设置一下headers头

    例如:

    

DEFAULT_REQUEST_HEADERS = {
  ‘Accept‘: ‘text/html, application/xhtml+xml, application/xml‘,
  ‘Accept-Language‘: ‘zh-CN,zh;q=0.8‘,
  ‘Host‘:‘ip84.com‘,
  ‘Referer‘:‘http://ip84.com/‘,
  ‘X-XHR-Referer‘:‘http://ip84.com/‘
}


         


反爬虫机制方法API

标签:linux   ipo   oat   referer   request方法   trident   总结   linu   mit   

原文地址:http://www.cnblogs.com/syketw23/p/7667629.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!