码迷,mamicode.com
首页 > 编程语言 > 详细

Python 爬虫知识点 - 淘宝商品检索结果抓包分析(续二)

时间:2016-12-22 00:14:30      阅读:834      评论:0      收藏:0      [点我收藏+]

标签:rom   try   iat   push   nbsp   offset   查询   ajax   safari   

一、URL分析

  通过对“Python机器学习”结果抓包分析,有两个无规律的参数:_ksTS和callback。通过构建如下URL可以获得目标关键词的检索结果,如下所示:

https://s.taobao.com/search?data-key=s&data-value=44&ajax=true&_ksTS=1482325509866_2527&callback=jsonp2528&q=Python机器学习&imgfile=&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20161221&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=.44&fs=1&s=0

https://s.taobao.com/search?data-key=s&data-value=88&ajax=true&_ksTS=1482325509866_2527&callback=jsonp2528&q=Python机器学习&imgfile=&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20161221&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=.44&fs=1&s=44

https://s.taobao.com/search?data-key=s&data-value=132&ajax=true&_ksTS=1482325509866_2527&callback=jsonp2528&q=Python机器学习&imgfile=&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20161221&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=.44&fs=1&s=88

https://s.taobao.com/search?data-key=s&data-value=176&ajax=true&_ksTS=1482325509866_2527&callback=jsonp2528&q=Python机器学习&imgfile=&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20161221&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=.44&fs=1&s=132


https://s.taobao.com/search?data-key=s&data-value=220&ajax=true&_ksTS=1482325509866_2527&callback=jsonp2528&q=Python机器学习&imgfile=&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20161221&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=.44&fs=1&s=176


https://s.taobao.com/search?data-key=s&data-value=264&ajax=true&_ksTS=1482325509866_2527&callback=jsonp2528&q=Python机器学习&imgfile=&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20161221&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=.44&fs=1&s=220


https://s.taobao.com/search?data-key=s&data-value=308&ajax=true&_ksTS=1482325509866_2527&callback=jsonp2528&q=Python机器学习&imgfile=&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20161221&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=.44&fs=1&s=264


https://s.taobao.com/search?data-key=s&data-value=352&ajax=true&_ksTS=1482325509866_2527&callback=jsonp2528&q=Python机器学习&imgfile=&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20161221&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=.44&fs=1&s=308

 

二、关键字分析

 

1、q查询关键词

2、data-value显示记录数

3、s上一页记录数

4、s与data-value的差值即当页显示数量

 

三、Python抓取数据

 

#__author__ = ‘Joker‘
# -*- coding:utf-8 -*-

import re
import urllib.request
keyWord1 = "Python机器学习"
keyWord2 = urllib.request.quote(keyWord1)
headers = ("User-Agent","MMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.1708.400 QQBrowser/9.5.9635.400")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
urllib.request.install_opener(opener)
for j in range(1,25):
try:
curPage = 44
prePage = 0
url = "https://s.taobao.com/search?data-key=s&data-value=" + str(
curPage) + "&ajax=true&_ksTS=1482325509866_2527&callback=jsonp2528&q=" + keyWord2 + "&imgfile=&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20161221&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=.44&fs=1&s=" + str(
prePage)
data = urllib.request.urlopen(url).read().decode("utf-8", "ignore")
patTitle = ‘"title":"(.*?)","raw_title"‘
titles = re.compile(patTitle).findall(data)
patRawTitle = ‘"raw_title":"(.*?)"‘
rawTitles = re.compile(patRawTitle).findall(data)
patImage = ‘"pic_url":"//(.*?)","‘
rawImages = re.compile(patImage).findall(data)
patPrice = ‘"view_price":"(.*?)","‘
rawPrices = re.compile(patPrice).findall(data)
patNick = ‘"nick"(.*?)","‘
rawNicks = re.compile(patNick).findall(data)
for i in range(0,len(titles)):
print("-------------------")
print("第" + str(j+1) + "页,第" + str(i+1) + "本" )
#print(titles[i])
print(rawTitles[i])
print(rawImages[i])
print(rawPrices[i])
print(rawNicks[i])
print("-------------------")
prePage = 44 * j
curPage = 44 + prePage
except urllib.error.URLError as e:
if hasattr(e,"code"):
print(e.code)
if hasattr(e,"reason"):
print(e.reason)
except Exception as e:
print(e)

 

Python 爬虫知识点 - 淘宝商品检索结果抓包分析(续二)

标签:rom   try   iat   push   nbsp   offset   查询   ajax   safari   

原文地址:http://www.cnblogs.com/defineconst/p/6209274.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!