功能描述
- 目标:获取淘宝搜索页面的信息,提取其中名称和价格
- 理解:淘宝搜索接口 翻页的处理
- 技术路线:requests+re
搜索"书包",浏览器起始页链接:
https://s.taobao.com/search?q=书包&commend=all&ssid=s5e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856taobaoitem.1&ie=utf8&initiative_id=tbindexz_20170306
第二页:
https://s.taobao.com/search?q=书包&commend=all&ssid=s5e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856taobaoitem.1&ie=utf8&initiative_id=tbindexz_20170306&bcoffset=4&ntoffset=4&p4ppushleft=1%2C48&s=44
每一页有44个商品。
程序的结构
- 步骤1 提交商品搜索请求,循环获取页面
- 步骤2 对于每个页面,提取商品名称和价格信息
- 步骤3 将信息输出在屏幕上
1 import requests 2 import re 3 def getHTMLText(url): 4 try: 5 r = requests.get(url, timeout = 30) 6 r.raise_for_status() 7 r.encoding = r.apparent_encoding 8 return r.text 9 except: 10 return "" 11 def parsePage(goodsList, html): 12 try: 13 priseList = re.findall(r‘\"view_price\"\:\"[\d\.]*\"‘,html) 14 titleList = re.findall(r‘\"raw_title\"\:\".*?\"‘, html) 15 for i in range(len(priseList)): 16 prise = eval(priseList[i].split(‘:‘)[1]) 17 title = eval(titleList[i].split(‘:‘)[1]) 18 goodsList.append([prise, title]) 19 except: 20 print("") 21 def printGoodsList(goodsList): 22 tplt = "{:4}\t{:8}\t{:16}" 23 print(tplt.format("序号", "价格", "商品名称")) 24 count = 0 25 for g in goodsList: 26 count = count + 1 27 print(tplt.format(count, g[0], g[1])) 28 def main(): 29 goods = ‘书包‘ 30 depth = 3 31 stard_url = ‘https://s.taobao.com/search?q=‘ + goods 32 infoList = [] 33 for i in range(depth): 34 try: 35 url = stard_url + ‘&s‘ + str(44 * i) 36 html = getHTMLText(url) 37 parsePage(infoList, html) 38 except: 39 continue 40 printGoodsList(infoList) 41 main()
输出(部分):