标签:info lines range roc 使用方法 区别 iter ber 工厂
一个进程就是个一个程序, 运行一个脚本文件, 跑多个程序
提升爬虫效率
工厂 ==> 车间 ==> 工人
1 from multiprocessing import Pool 2 pool = Pool(processes=4) 3 pool.map(func,iterable)
1 import re 2 import time 3 from multiprocessing import Pool 4 ? 5 import requests 6 ? 7 headers = { 8 ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0‘ 9 } 10 ? 11 def re_scraper(url): 12 res = requests.get(url,headers=headers) 13 names = re.findall(‘<h2>(.*?)</h2>‘, res.text, re.S) 14 contents = re.findall(‘<div class="content">.*?<span>(.*?)</span>‘, res.text, re.S) 15 laughs = re.findall(‘<i class="number">(\d+)</i>‘,res.text,re.S) 16 comments = re.findall(‘<i class="number">(\d+)</i>‘, res.text, re.S) 17 infos = list() 18 for name,content,laugh,comment in zip(names,contents,laughs,comments): 19 info = { 20 ‘name‘:name, 21 ‘content‘:content, 22 ‘laugh‘:laugh, 23 ‘comment‘:comment 24 } 25 infos.append(info) 26 return infos 27 ? 28 if __name__ == "__main__": 29 urls = [‘https://www.qiushibaike.com/8hr/page/{}/‘.format(str(i)) for i in range(1, 36)] 30 start_1 = time.time() 31 for url in urls: 32 re_scraper(url) 33 end_1 = time.time() 34 print(‘串行爬虫耗时:‘,end_1 - start_1) 35 ? 36 start_2 = time.time() 37 pool = Pool(processes=2) 38 pool.map(re_scraper,urls) 39 end_2 = time.time() 40 print(‘2进程爬虫耗时:‘,end_2 - start_2) 41 ? 42 start_3 = time.time() 43 pool = Pool(processes=4) 44 pool.map(re_scraper,urls) 45 end_3 = time.time() 46 print(‘4进程爬虫耗时:‘,end_3 - start_3) 47
1 运行结果: 2 3 [Running] python "f:\WWW\test_py\compare_test.py" 4 串行爬虫耗时: 14.95523715019226 5 2进程爬虫耗时: 11.39123272895813 6 4进程爬虫耗时: 4.0303635597229 7 8 [Done] exited with code=0 in 32.827 seconds
标签:info lines range roc 使用方法 区别 iter ber 工厂
原文地址:https://www.cnblogs.com/xuxaut-558/p/10166642.html