python实现百度URL的采集

时间：2017-08-25 15:59:05 阅读：295 评论：0 收藏：0 [点我收藏+]

标签：线程 art sys.argv .com 关键字 www name use pytho

用到的模块：threading多线程模块 requests模块 BeautifulSoup模块

实现功能：可以通过命令行控制关键字以及线程数，实现百度的url采集

代码如下：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Date    : 2017-08-25 12:47:59
# @Author : arong
# @Link    :
# @Version : $Id$

import requests,threading
from bs4 import BeautifulSoup as bs
import time,Queue
import sys

headers={‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0‘}
class BaiduSpider(threading.Thread):
   def __init__(self,queue):
       threading.Thread.__init__(self)
       self._queue=queue

   def run(self):
       while not self._queue.empty():
           url=self._queue.get()
           try:
               self.spider(url)
           except Exception,e:
               print e
               pass
   def spider(self,url):
       r=requests.get(url=url,headers=headers)
       soup=bs(r.content,‘lxml‘)
       result=soup.find_all(name=‘a‘,attrs={‘class‘:‘c-showurl‘})
       for url in result:
           url2=url[‘href‘]
           r_get_url=requests.get(url=url2,headers=headers,timeout=8)
           if r_get_url.status_code==200:
               url_tmp=r_get_url.url.split(‘/‘)
               print url_tmp[2]

def main(keyword,thread_count):
   queue=Queue.Queue()
   for i in range(0,50,10):
       queue.put(‘https://www.baidu.com/s?wd=%s&pn=%s‘%(keyword,str(i)))
   threads=[]
   thread_count=int(thread_count)
   for i in range(thread_count):
       threads.append(BaiduSpider(queue))
   for t in threads:
       t.start()
   for t in threads:
       t.join()
if __name__==‘__main__‘:
   if len(sys.argv)!=3:
       print ‘use %s keyword,thread_count‘%sys.argv[0]
       sys.exit(1)
   else:
       main(sys.argv[1],sys.argv[2])

感觉还是有点慢，优化的事情等再学习学习再说吧哈哈哈

python实现百度URL的采集

标签：线程 art sys.argv .com 关键字 www name use pytho

原文地址：http://www.cnblogs.com/arongmh/p/7428151.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行