码迷,mamicode.com
首页 > 编程语言 > 详细

[多线程]亚马逊图书排名查询

时间:2018-05-08 16:36:47      阅读:190      评论:0      收藏:0      [点我收藏+]

标签:爬虫   数据   done   cts   amazon   beautiful   并发   %s   查询   

版本:Python3.6

库:atexit, re, threading, time, urllib3, bs4

亚马逊有反爬虫机制,header中至少要加入一个信息,此例中加入UA,不过仍然时常不好使,需要重复尝试。

# _*_coding:utf-8_*_
# created by Zhang Q.L.on 2018/5/7 0007
from atexit import register
from re import compile
from threading import Thread
from time import ctime
import urllib3
import bs4

header = {
    ‘User-Agent‘: ‘AppleWebKit/537.36 (KHTML, like Gecko)‘
}
headerSample = {
    ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36‘
}
REGEX = compile(‘#([\d,]+) in Books‘)
url = ‘https://item.jd.com/7081550.html‘
urltest = ‘https://www.amazon.com//dp/‘
urltest2 = ‘https://www.amazon.com//dp/0132269937‘
ISBNs = {
    ‘0132269937‘:‘Core Python Programming‘,
    ‘0132356139‘:‘Python Web Development with Django‘,
    ‘0137143419‘:‘Python Fundamentals‘,
}

def httpget(isbn):
    http = urllib3.PoolManager()   #首先产生一个PoolManager实例
    urllib3.disable_warnings()     #忽略https的无效证书警报
    # page = http.request(‘GET‘,‘%s‘%urltest2,headers=header)   #发起GET请求
    page = http.request(‘GET‘,‘%s%s‘%(urltest,isbn),headers=header)   #发起GET请求
    print(page.status)        #服务器返回的状态代码
    # print(page.data)          #服务器返回的数据,返回的是xml字符串
    # print(page.data.decode())  #利用默认‘utf-8‘编码格式去解码
    res = bs4.BeautifulSoup(page.data,‘lxml‘)  #利用lxml模块解码
    res = str((res))
    # print(res)
    return REGEX.findall(res)[0]

def _showRanking(isbn):
    print(‘- %r ranked %s‘%(ISBNs[isbn], httpget(isbn)))


def _main():
    print(‘At‘,ctime(),‘on Amazon...‘)
    for isbn in ISBNs:
        Thread(target=_showRanking, args=(isbn,)).start()

@register
def _atexit():
    print(‘all DONE at:‘,ctime())

if __name__ == ‘__main__‘:
    _main()

输出结果:

D:\装机软件\python3.6\python3.exe C:/Users/Administrator/PycharmProjects/Python核心编程/多线程编程/amazon-nothread.py
At Tue May  8 15:10:44 2018 on Amazon...
200
200
200
- ‘Python Fundamentals‘ ranked 4,517,952
- ‘Python Web Development with Django‘ ranked 1,243,459
- ‘Core Python Programming‘ ranked 674,874
all DONE at: Tue May  8 15:10:50 2018

Process finished with exit code 0

与不引入线程的程序进行对比,主要有两个区别:

1.由于是并发处理模式,处理时间变短;

2.引入线程之后处理结果输出的顺序按完成的顺序输出,而单线程版本按照变量的顺序,也就是由字典的键决定的。

[多线程]亚马逊图书排名查询

标签:爬虫   数据   done   cts   amazon   beautiful   并发   %s   查询   

原文地址:https://www.cnblogs.com/auqarius/p/9008408.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!