码迷,mamicode.com
首页 > 编程语言 > 详细

使用线程池进行爬虫

时间:2018-01-17 22:00:54      阅读:219      评论:0      收藏:0      [点我收藏+]

标签:exception   pre   使用   bsp   style   time   main   digest   sts   

import requests #pip3 install requests
import re
import hashlib
import time
from concurrent.futures import ThreadPoolExecutor

pool=ThreadPoolExecutor(50)
movie_path=rC:\mp4

def get_page(url):
    try:
        response=requests.get(url)
        if response.status_code == 200:
            return response.text
    except Exception:
        pass

def parse_index(index_page):
    index_page=index_page.result()
    urls=re.findall(class="items".*?href="(.*?)",index_page,re.S)
    for detail_url in urls:
        if not detail_url.startswith(http):
            detail_url=http://www.xiaohuar.com+detail_url
        pool.submit(get_page,detail_url).add_done_callback(parse_detail)

def parse_detail(detail_page):
    detail_page=detail_page.result()
    l=re.findall(id="media".*?src="(.*?)",detail_page,re.S)
    if l:
        movie_url=l[0]
        if movie_url.endswith(mp4):
            pool.submit(get_movie,movie_url)

def get_movie(url):
    try:
        response=requests.get(url)
        if response.status_code == 200:
            m=hashlib.md5()
            m.update(str(time.time()).encode(utf-8))
            m.update(url.encode(utf-8))
            filepath=%s\%s.mp4 %(movie_path,m.hexdigest())
            with open(filepath,wb) as f:
                f.write(response.content)
                print(%s 下载成功 %url)
    except Exception:
        pass

def main():
    base_url=http://www.xiaohuar.com/list-3-{page_num}.html
    for i in range(5):
        url=base_url.format(page_num=i)
        pool.submit(get_page,url).add_done_callback(parse_index)

if __name__ == __main__:
    main()

 

使用线程池进行爬虫

标签:exception   pre   使用   bsp   style   time   main   digest   sts   

原文地址:https://www.cnblogs.com/ldq1996/p/8306015.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!