多进程爬虫

时间：2018-12-24 10:29:16 阅读：171 评论：0 收藏：0 [点我收藏+]

标签：info lines range roc 使用方法区别 iter ber 工厂

多进程简介

一个进程就是个一个程序, 运行一个脚本文件, 跑多个程序

为什么学习多线程

提升爬虫效率

多进程和多线程的区别

工厂 ==> 车间 ==> 工人

多进程的使用方法

from multiprocessing import Pool
pool = Pool(processes=4)
pool.map(func,iterable)
 

性能对比

爬取url:https://www.qiushibaike.com/8hr/page/1/

 1 import re
 2 import time
 3 from multiprocessing import Pool
 4 ?
 5 import requests
 6 ?
 7 headers = {
 8     ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0‘
 9 }
10 ?
11 def re_scraper(url):
12     res = requests.get(url,headers=headers)
13     names = re.findall(‘<h2>(.*?)</h2>‘, res.text, re.S)
14     contents = re.findall(‘<div class="content">.*?<span>(.*?)</span>‘, res.text, re.S)
15     laughs = re.findall(‘<i class="number">(\d+)</i>‘,res.text,re.S)
16     comments = re.findall(‘<i class="number">(\d+)</i>‘, res.text, re.S)
17     infos = list()
18     for name,content,laugh,comment in zip(names,contents,laughs,comments):
19         info = {
20             ‘name‘:name,
21             ‘content‘:content,
22             ‘laugh‘:laugh,
23             ‘comment‘:comment
24         }
25         infos.append(info)
26     return infos
27 ?
28 if __name__ == "__main__":
29     urls = [‘https://www.qiushibaike.com/8hr/page/{}/‘.format(str(i)) for i in range(1, 36)]
30     start_1 = time.time()
31     for url in urls:
32         re_scraper(url)
33     end_1 = time.time()
34     print(‘串行爬虫耗时:‘,end_1 - start_1)
35 ?
36     start_2 = time.time()
37     pool = Pool(processes=2)
38     pool.map(re_scraper,urls)
39     end_2 = time.time()
40     print(‘2进程爬虫耗时:‘,end_2 - start_2)
41 ?
42     start_3 = time.time()
43     pool = Pool(processes=4)
44     pool.map(re_scraper,urls)
45     end_3 = time.time()
46     print(‘4进程爬虫耗时:‘,end_3 - start_3)
47

1 运行结果:
2 
3 [Running] python "f:\WWW\test_py\compare_test.py"
4 串行爬虫耗时: 14.95523715019226
5 2进程爬虫耗时: 11.39123272895813
6 4进程爬虫耗时: 4.0303635597229
7 
8 [Done] exited with code=0 in 32.827 seconds

多进程爬虫

标签：info lines range roc 使用方法区别 iter ber 工厂

原文地址：https://www.cnblogs.com/xuxaut-558/p/10166642.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行