标签:爬虫 数据 done cts amazon beautiful 并发 %s 查询
版本:Python3.6
库:atexit, re, threading, time, urllib3, bs4
亚马逊有反爬虫机制,header中至少要加入一个信息,此例中加入UA,不过仍然时常不好使,需要重复尝试。
# _*_coding:utf-8_*_
# created by Zhang Q.L.on 2018/5/7 0007
from atexit import register
from re import compile
from threading import Thread
from time import ctime
import urllib3
import bs4
header = {
‘User-Agent‘: ‘AppleWebKit/537.36 (KHTML, like Gecko)‘
}
headerSample = {
‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36‘
}
REGEX = compile(‘#([\d,]+) in Books‘)
url = ‘https://item.jd.com/7081550.html‘
urltest = ‘https://www.amazon.com//dp/‘
urltest2 = ‘https://www.amazon.com//dp/0132269937‘
ISBNs = {
‘0132269937‘:‘Core Python Programming‘,
‘0132356139‘:‘Python Web Development with Django‘,
‘0137143419‘:‘Python Fundamentals‘,
}
def httpget(isbn):
http = urllib3.PoolManager() #首先产生一个PoolManager实例
urllib3.disable_warnings() #忽略https的无效证书警报
# page = http.request(‘GET‘,‘%s‘%urltest2,headers=header) #发起GET请求
page = http.request(‘GET‘,‘%s%s‘%(urltest,isbn),headers=header) #发起GET请求
print(page.status) #服务器返回的状态代码
# print(page.data) #服务器返回的数据,返回的是xml字符串
# print(page.data.decode()) #利用默认‘utf-8‘编码格式去解码
res = bs4.BeautifulSoup(page.data,‘lxml‘) #利用lxml模块解码
res = str((res))
# print(res)
return REGEX.findall(res)[0]
def _showRanking(isbn):
print(‘- %r ranked %s‘%(ISBNs[isbn], httpget(isbn)))
def _main():
print(‘At‘,ctime(),‘on Amazon...‘)
for isbn in ISBNs:
Thread(target=_showRanking, args=(isbn,)).start()
@register
def _atexit():
print(‘all DONE at:‘,ctime())
if __name__ == ‘__main__‘:
_main()
输出结果:
D:\装机软件\python3.6\python3.exe C:/Users/Administrator/PycharmProjects/Python核心编程/多线程编程/amazon-nothread.py
At Tue May 8 15:10:44 2018 on Amazon...
200
200
200
- ‘Python Fundamentals‘ ranked 4,517,952
- ‘Python Web Development with Django‘ ranked 1,243,459
- ‘Core Python Programming‘ ranked 674,874
all DONE at: Tue May 8 15:10:50 2018
Process finished with exit code 0
与不引入线程的程序进行对比,主要有两个区别:
1.由于是并发处理模式,处理时间变短;
2.引入线程之后处理结果输出的顺序按完成的顺序输出,而单线程版本按照变量的顺序,也就是由字典的键决定的。
标签:爬虫 数据 done cts amazon beautiful 并发 %s 查询
原文地址:https://www.cnblogs.com/auqarius/p/9008408.html