线程和进程——python的多线程

时间：2018-03-03 18:17:35 阅读：213 评论：0 收藏：0 [点我收藏+]

　　　　首先要分清楚这两个概念。

　　　　进程：一个具有独立功能的程序关于某个数据集合的一次运行活动。其一，它是一个实体；其二，是一个“执行中的程序”。

　　　　线程：进程里包含的执行单元叫线程，一个进程可以包含多个线程。它是cpu的基本调度单位。

　　　　一个进程的内存空间是可以被它的线程共享的，但是一个线程在使用时，其它线程必须等待。通过“锁”防止多个线程同时占用空间。

　　　　在不同线程同时访问时，数据的保护机制是怎样的呢？这就要提到python的一个“锁”——GIL（全称为全局解释器锁），要想利用多核系统，Python必须支持多线程运行。作为解释型语言，Python的解释器必须做到既安全又高效。我们都知道多线程编程会遇到的问题。解释器要留意的是避免在不同的线程操作内部共享的数据。同时它还要保证在管理用户线程时保证总是有最大化的计算资源。所以python就有了这么一个“锁”。这是一个让人头疼的问题，“锁”的存在解决了那一些麻烦，但是也牺牲了python的多线程能力。

　　　　python的多线程适合于：大量密集的I/O处理

　　　　python的多进程：大量的密集并行计算

　　　　尽管python的多线程功能看起来比较鸡肋，但是在爬虫中的应用，还是可以提高效率的。

  1 import requests
  2 import threading    #使用线程库
  3 from queue import Queue
  4 from lxml import etree
  5 import json
  6 import time
  7 
  8 
  9 class ThreadCrawl(threading.Thread):
 10     def __init__(self,threadName,pageQueue,dataQueue):
 11 
 12         threading.Thread.__init__(self)
 13     #调用父类初始化方法
 14     #super(ThreadCrawl,self).__init__()
 15         self.threadName=threadName
 16         self.pageQueue=pageQueue
 17         self.dataQueue=dataQueue
 18         self.headers={"User-Agent":"Mozilla/5.0(Macintosh;IntelMacOSX10_7_0)AppleWebKit/535.11(KHTML,likeGecko)Chrome/17.0.963.56Safari/535.11"}
 19 
 20 
 21     def run(self):
 22          pass
 23                 self.dataQueue.put(content)
 24 
 25             except:
 26                 pass
 27         print("结束" + self.threadName)
 28 
 29 class ThreadParse(threading.Thread):
 30     def __init__(self,threadName,dataQueue,filename,lock):
 31         super(ThreadParse,self).__init__()
 32         self.threadName=threadName
 33         self.dataQueue=dataQueue
 34         self.filename=filename
 35         self.lock=lock
 36 
 37 
 38     def run(self):
 39         pass
 40 
 41     def parse(self,html):
 42         pass
 43         with self.lock:
 44             self.filename.write(json.dumps(items,ensure_ascii=False).encoding("utf-8") + "\n")
 45 
 46 
 47 
 48 
 49 grasp_exit=False
 50 parse_exit=False
 51 
 52 
 53 
 54 def main():
 55     #设置页码队列
 56     pageQueue=Queue(20)
 57     #放入1-10个数字，按照队列的先进先出原则
 58     for i in range(1,21):
 59         pageQueue.put(i)
 60 
 61     #采集结果的队列，为空则表示无限制
 62     dataQueue=Queue()
 63     
 64     filename=open("lagou.json","a")
 65 
 66     #创建锁
 67     lock=threading.Lock()
 68 
 69 
 70     #采集线程
 71     graspList=["采集线程1","采集线程2","采集线程3"]
 72     #存储线程
 73     threadcrawl=[]
 74     for threadName in graspList:
 75         thread=ThreadCrawl(threadName,pageQueue,dataQueue)
 76         thread.start()
 77         threadcrawl.append(thread)
 78 
 79     #解析线程
 80     parseList=["解析线程1","解析线程2","解析线程3"]
 81     #存储线程
 82     threadparse=[]
 83     for threadName in parseList:
 84         thread=ThreadParse(threadName,dataQueue,filename,lock)
 85         thread.start()
 86         threadparse.append(thread)
 87 
 88     while not pageQueue.empty():
 89         pass
 90 
 91 
 92     global grasp_exit
 93     grasp_exit=True
 94 
 95     print("队列为空")
 96 
 97 
 98     for thread in threadcrawl:
 99         thread.join()
100 
101     while not dataQueue.empty():
102         pass
103 
104     global parse_exit
105     parse_exit=True
106 
107     for thread in threadparse:
108         thread.join()
109     with lock:
110         filename.close() 
111 if __name__=="__main__":
112     main()

　　　　上面是以拉勾网为例，写了一个多线程。代码不全，完整代码参考我的github。效果如下：

技术分享图片

　　　　多线程能提高的效率是有限的，后期会使用异步网络框架如scrapy来提高爬虫效率。

线程和进程——python的多线程

标签：集合关于访问 dump 网络框架技术 append 解析效率

原文地址：https://www.cnblogs.com/CCColby/p/8493927.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行