码迷,mamicode.com
首页 > 编程语言 > 详细

[爬虫]采用Go语言爬取天猫页面

时间:2019-04-14 18:00:21      阅读:230      评论:0      收藏:0      [点我收藏+]

标签:rom   tail   regional   x11   EDA   mod   eid   .json   gecko   

最近工作中有一个需求,需要爬取天猫商品的信息,整个需求的过程如下:

修改后端广告交易平台的代码,从阿里上传的素材中解析url,该url格式如下:

https://handycam.alicdn.com/slideshow/26/7ef5aed1e3c39843e8feac816a436ecf.mp4?content=%7B%22items%22%3A%5B%7B%22images%22%3A%5B%22https%3A%2F%2Fasearch.alicdn.com%2Fbao%2Fuploaded%2F%2Fi4%2F22356367%2FTB2PMQinN6I8KJjy0FgXXXXzVXa_%21%210-saturn_solar.jpg%22%5D%2C%22itemid%22%3A%227664169349%22%2C%22shorttitle%22%3A%22%E4%B9%92%E4%B9%93%E7%90%83%E6%8B%8D%20%E6%97%A0%E7%BA%BF%E4%B8%93%E5%B1%9E%22%7D%5D%7D

明显进行编码了,首先我们需要进行解码,解码的在线网站如下:

http://tool.chinaz.com/Tools/urlencode.aspx

经过decode以后,我们得到:

https://handycam.alicdn.com/slideshow/26/7ef5aed1e3c39843e8feac816a436ecf.mp4?content={"items":[{"images":["https://asearch.alicdn.com/bao/uploaded//i4/22356367/TB2PMQinN6I8KJjy0FgXXXXzVXa_!!0-saturn_solar.jpg"],"itemid":"7664169349","shorttitle":"乒乓球拍 无线专属"}]}

我们需要的就是其中的"itemid":"7664169349"。

然后我们通过访问https://detail.tmall.com/item.htm?id=7664169349,打开如下页面:

技术图片

这就是我们需要抓取的页面信息。广告交易平台将解析的itemid放入到nsq中,爬虫系统通过拼接URL抓取页面的关键信息,然后将关键信息发送到Kafka中,Hive和ES再从Kafka中获取相应的信息,进行查询操作。

第一步

第一步就是解析出ItemId,在广告交易平台我们可以获取需要解析的URL,接下来我们用代码对URL进行decode并且解析出相应的ItemId数值。由于项目采用的是Golang,所以这里以Golang为例,Python写其实更简单,原理一样。

URL解析的方法,可以参考:

https://gobyexample.com/url-parsing

JSON序列化和反序列化,可以参考:

https://www.cnblogs.com/liang1101/p/6741262.html

这里给出我的代码:

package main

import (
    "encoding/json"
    "fmt"
    "net/url"
    "strconv"
)
//结构体的首字母大写
type item struct {
    Images []string
    ItemId string
    ShortTitle string
}

func main() {
    var urlstring string = "https://handycam.alicdn.com/slideshow/26/7ef5aed1e3c39843e8feac816a436ecf.mp4?content=%7B%22items%22%3A%5B%7B%22images%22%3A%5B%22https%3A%2F%2Fasearch.alicdn.com%2Fbao%2Fuploaded%2F%2Fi4%2F22356367%2FTB2PMQinN6I8KJjy0FgXXXXzVXa_%21%210-saturn_solar.jpg%22%5D%2C%22itemid%22%3A%227664169349%22%2C%22shorttitle%22%3A%22%E4%B9%92%E4%B9%93%E7%90%83%E6%8B%8D%20%E6%97%A0%E7%BA%BF%E4%B8%93%E5%B1%9E%22%7D%5D%7D"
    unescape, err := url.QueryUnescape(urlstring)
    if err != nil {
        fmt.Println("err is", err)
    }
    fmt.Println(unescape)
    parse, err := url.Parse(unescape)
    fmt.Println(parse.RawQuery)
    query, err := url.ParseQuery(parse.RawQuery)
    fmt.Println(query)
    fmt.Printf("%T, %v\n", query["content"][0], query["content"][0])
    m := make(map[string][]item)
    json.Unmarshal([]byte(query["content"][0]), &m)
    fmt.Println("m:", m)
    itemValue := m["items"][0]
    fmt.Println(itemValue.ItemId)
    //转成int64
    i, err := strconv.ParseInt(itemValue.ItemId, 10, 64)
    fmt.Printf("%T, %v", i, i)
}

运行结果:

技术图片

便可以得到我们需要的ItemId数值。

第二步

第二步就是拼接我们的URL进行页面内容的爬取。

如何通过GoLang拉取网页呢?附上一个简单demo。

package main
import (
    "net/http"
    "io/ioutil"
    "fmt"
)
func main(){
    var website string = "http://www.future.org.cn"
    if resp,err := http.Get(website); err == nil{
        defer resp.Body.Close()
        if body, err := ioutil.ReadAll(resp.Body); err == nil {
            fmt.Println("HTML content:", string(body));
        }else{
            fmt.Println("Cannot read from connected http server:", err);
        }
    }else{
        fmt.Println("Cannot connect the server:", err);
    }
}

但是爬取页面以后,会发现个问题,就是中文显示乱码。

中文乱码问题解决,参考:

https://gocn.vip/article/364

安装 iconv-go

go get github.com/djimenez/iconv-go

可以获取以后再转码,比如:

func convFromGbk(s string) string {
    gbkConvert, _ := iconv.NewConverter("gbk", "utf-8")
    res, _ := gbkConvert.ConvertString(s)
    return res
}

也可以用如下方式转换Reader:

req, err := http.NewRequest("GET", url, nil)
    if err != nil {
        return nil, err
    }
    req.Header.Set("User-Agent", ualist[rand.Intn(len(ualist))])
    rsp, err := j.client.Do(req)
    if err != nil {
        return nil, err
    }
    //转码
    utfBody, err := iconv.NewReader(rsp.Body, "gb2312", "utf-8")
    //if body, err := ioutil.ReadAll(utfBody); err == nil {
    //    fmt.Println("HTML content:", string(body))
    //}

爬取以后的页面我们需要进行解析,这里采用的XPath。

关于使用XPath的方式,参考:

http://www.w3school.com.cn/xpath/xpath_axes.asp

非常简单,看完就明白了。

因为爬取之后是html,你只需要获取自己想要的内容即可,说白了就是解析html。

接下来还有一个难点,就是我们抓取的静态页面,很多信息都包含,但是价格信息不包含,因为它是动态加载的。

我们不妨分析一下,

技术图片

我们将其点开,复制URL在浏览器打开,发现无法访问,403,不要着急,只需要在请求的Header中加上如下的参数即可。

技术图片

在代码中如下:

referer := fmt.Sprintf("https://detail.tmall.com/item.htm?id=%d", itemID)
req.Header.Set("Referer", referer)

我们查看响应发现是一个JSON,

技术图片

格式化一下:格式化网址:http://tool.oschina.net/codeformat/json

{
    "defaultModel": {
        "bannerDO": {
            "success": true
        }, 
        "deliveryDO": {
            "areaId": 110100, 
            "deliveryAddress": "浙江金华", 
            "deliverySkuMap": {
                "6310159781": [
                    {
                        "arrivalNextDay": false, 
                        "arrivalThisDay": false, 
                        "forceMocked": false, 
                        "postage": "快递: 0.00 ", 
                        "postageFree": false, 
                        "skuDeliveryAddress": "浙江金华", 
                        "type": 0
                    }
                ], 
                "default": [
                    {
                        "arrivalNextDay": false, 
                        "arrivalThisDay": false, 
                        "forceMocked": false, 
                        "postage": "快递: 0.00 ", 
                        "postageFree": false, 
                        "skuDeliveryAddress": "浙江金华", 
                        "type": 0
                    }
                ], 
                "6310159797": [
                    {
                        "arrivalNextDay": false, 
                        "arrivalThisDay": false, 
                        "forceMocked": false, 
                        "postage": "快递: 0.00 ", 
                        "postageFree": false, 
                        "skuDeliveryAddress": "浙江金华", 
                        "type": 0
                    }
                ], 
                "3280089025135": [
                    {
                        "arrivalNextDay": false, 
                        "arrivalThisDay": false, 
                        "forceMocked": false, 
                        "postage": "快递: 0.00 ", 
                        "postageFree": false, 
                        "skuDeliveryAddress": "浙江金华", 
                        "type": 0
                    }
                ], 
                "3280089025136": [
                    {
                        "arrivalNextDay": false, 
                        "arrivalThisDay": false, 
                        "forceMocked": false, 
                        "postage": "快递: 0.00 ", 
                        "postageFree": false, 
                        "skuDeliveryAddress": "浙江金华", 
                        "type": 0
                    }
                ]
            }, 
            "destination": "北京市", 
            "success": true
        }, 
        "detailPageTipsDO": {
            "crowdType": 0, 
            "hasCoupon": true, 
            "hideIcons": false, 
            "jhs99": false, 
            "minicartSurprise": 0, 
            "onlyShowOnePrice": false, 
            "priceDisplayType": 4, 
            "primaryPicIcons": [ ], 
            "prime": false, 
            "showCuntaoIcon": false, 
            "showDou11Style": false, 
            "showDou11SugPromPrice": false, 
            "showDou12CornerIcon": false, 
            "showDuo11Stage": 0, 
            "showJuIcon": false, 
            "showMaskedDou11SugPrice": false, 
            "success": true, 
            "trueDuo11Prom": false
        }, 
        "doubleEleven2014": {
            "doubleElevenItem": false, 
            "halfOffItem": false, 
            "showAtmosphere": false, 
            "showRightRecommendedArea": false, 
            "step": 0, 
            "success": true
        }, 
        "extendedData": { }, 
        "extras": { }, 
        "gatewayDO": {
            "changeLocationGateway": {
                "queryDelivery": true, 
                "queryProm": false
            }, 
            "success": true, 
            "trade": {
                "addToBuyNow": { }, 
                "addToCart": { }
            }
        }, 
        "inventoryDO": {
            "hidden": false, 
            "icTotalQuantity": 225, 
            "skuQuantity": {
                "3280089025136": {
                    "quantity": 71, 
                    "totalQuantity": 71, 
                    "type": 1
                }, 
                "6310159781": {
                    "quantity": 33, 
                    "totalQuantity": 33, 
                    "type": 1
                }, 
                "6310159797": {
                    "quantity": 44, 
                    "totalQuantity": 44, 
                    "type": 1
                }, 
                "3280089025135": {
                    "quantity": 77, 
                    "totalQuantity": 77, 
                    "type": 1
                }
            }, 
            "success": true, 
            "totalQuantity": 225, 
            "type": 1
        }, 
        "itemPriceResultDO": {
            "areaId": 110100, 
            "duo11Item": false, 
            "duo11Stage": 0, 
            "extraPromShowRealPrice": false, 
            "halfOffItem": false, 
            "hasDPromotion": false, 
            "hasMobileProm": false, 
            "hasTmallappProm": false, 
            "hiddenNonBuyPrice": false, 
            "hideMeal": false, 
            "priceInfo": {
                "6310159781": {
                    "areaSold": true, 
                    "onlyShowOnePrice": false, 
                    "price": "178.00", 
                    "promotionList": [
                        {
                            "amountPromLimit": 0, 
                            "amountRestriction": "", 
                            "basePriceType": "IcPrice", 
                            "canBuyCouponNum": 0, 
                            "endTime": 1561651200000, 
                            "extraPromTextType": 0, 
                            "extraPromType": 0, 
                            "limitProm": false, 
                            "postageFree": false, 
                            "price": "75.00", 
                            "promType": "normal", 
                            "start": false, 
                            "startTime": 1546267717000, 
                            "status": 2, 
                            "tfCartSupport": false, 
                            "tmallCartSupport": false, 
                            "type": "火爆促销", 
                            "unLogBrandMember": false, 
                            "unLogShopVip": false, 
                            "unLogTbvip": false
                        }
                    ], 
                    "sortOrder": 0
                }, 
                "6310159797": {
                    "areaSold": true, 
                    "onlyShowOnePrice": false, 
                    "price": "178.00", 
                    "promotionList": [
                        {
                            "amountPromLimit": 0, 
                            "amountRestriction": "", 
                            "basePriceType": "IcPrice", 
                            "canBuyCouponNum": 0, 
                            "endTime": 1561651200000, 
                            "extraPromTextType": 0, 
                            "extraPromType": 0, 
                            "limitProm": false, 
                            "postageFree": false, 
                            "price": "75.00", 
                            "promType": "normal", 
                            "start": false, 
                            "startTime": 1546267717000, 
                            "status": 2, 
                            "tfCartSupport": false, 
                            "tmallCartSupport": false, 
                            "type": "火爆促销", 
                            "unLogBrandMember": false, 
                            "unLogShopVip": false, 
                            "unLogTbvip": false
                        }
                    ], 
                    "sortOrder": 0
                }, 
                "3280089025135": {
                    "areaSold": true, 
                    "onlyShowOnePrice": false, 
                    "price": "168.00", 
                    "promotionList": [
                        {
                            "amountPromLimit": 0, 
                            "amountRestriction": "", 
                            "basePriceType": "IcPrice", 
                            "canBuyCouponNum": 0, 
                            "endTime": 1561651200000, 
                            "extraPromTextType": 0, 
                            "extraPromType": 0, 
                            "limitProm": false, 
                            "postageFree": false, 
                            "price": "68.00", 
                            "promType": "normal", 
                            "start": false, 
                            "startTime": 1546267717000, 
                            "status": 2, 
                            "tfCartSupport": false, 
                            "tmallCartSupport": false, 
                            "type": "火爆促销", 
                            "unLogBrandMember": false, 
                            "unLogShopVip": false, 
                            "unLogTbvip": false
                        }
                    ], 
                    "sortOrder": 0
                }, 
                "3280089025136": {
                    "areaSold": true, 
                    "onlyShowOnePrice": false, 
                    "price": "168.00", 
                    "promotionList": [
                        {
                            "amountPromLimit": 0, 
                            "amountRestriction": "", 
                            "basePriceType": "IcPrice", 
                            "canBuyCouponNum": 0, 
                            "endTime": 1561651200000, 
                            "extraPromTextType": 0, 
                            "extraPromType": 0, 
                            "limitProm": false, 
                            "postageFree": false, 
                            "price": "68.00", 
                            "promType": "normal", 
                            "start": false, 
                            "startTime": 1546267717000, 
                            "status": 2, 
                            "tfCartSupport": false, 
                            "tmallCartSupport": false, 
                            "type": "火爆促销", 
                            "unLogBrandMember": false, 
                            "unLogShopVip": false, 
                            "unLogTbvip": false
                        }
                    ], 
                    "sortOrder": 0
                }
            }, 
            "queryProm": false, 
            "success": true, 
            "successCall": true, 
            "tmallShopProm": [ ]
        }, 
        "memberRightDO": {
            "activityType": 0, 
            "level": 0, 
            "postageFree": false, 
            "shopMember": false, 
            "success": true, 
            "time": 1, 
            "value": 0.5
        }, 
        "miscDO": {
            "bucketId": 15, 
            "city": "北京", 
            "cityId": 110100, 
            "debug": { }, 
            "hasCoupon": false, 
            "region": "东城区", 
            "regionId": 110101, 
            "rn": "fa015e69c6a4ca4bb559805d670557e7", 
            "smartBannerFlag": "top", 
            "success": true, 
            "supportCartRecommend": false, 
            "systemTime": "1555232632711", 
            "town": "东华门街道", 
            "townId": 110101001
        }, 
        "regionalizedData": {
            "success": true
        }, 
        "sellCountDO": {
            "sellCount": "5", 
            "success": true
        }, 
        "servicePromise": {
            "has3CPromise": false, 
            "servicePromiseList": [
                {
                    "description": "商品支持正品保障服务", 
                    "displayText": "正品保证", 
                    "icon": "无", 
                    "link": "//www.tmall.com/wow/portal/act/bzj", 
                    "rank": -1
                }, 
                {
                    "description": "极速退款是为诚信会员提供的退款退货流程的专享特权,额度是根据每个用户当前的信誉评级情况而定", 
                    "displayText": "极速退款", 
                    "icon": "//img.alicdn.com/bao/album/sys/icon/discount.gif", 
                    "link": "//vip.tmall.com/vip/privilege.htm?spm=3.1000588.0.141.2a0ae8&priv=speed", 
                    "rank": -1
                }, 
                {
                    "description": "卖家为您购买的商品投保退货运费险(保单生效以下单显示为准)", 
                    "displayText": "赠运费险", 
                    "icon": "//img.alicdn.com/bao/album/sys/icon/discount.gif", 
                    "link": "//service.tmall.com/support/tmall/knowledge-1121473.htm?spm=0.0.0.0.asbDA1", 
                    "rank": -1
                }, 
                {
                    "description": "七天无理由退换", 
                    "displayText": "七天无理由退换", 
                    "icon": "//img.alicdn.com/tps/i3/T1Vyl6FCBlXXaSQP_X-16-16.png", 
                    "link": "//pages.tmall.com/wow/seller/act/seven-day", 
                    "rank": -1
                }
            ], 
            "show": true, 
            "success": true, 
            "titleInformation": [ ]
        }, 
        "soldAreaDataDO": {
            "currentAreaEnable": true, 
            "success": true, 
            "useNewRegionalSales": true
        }, 
        "tradeResult": {
            "cartEnable": true, 
            "cartType": 2, 
            "miniTmallCartEnable": true, 
            "startTime": 1554812946000, 
            "success": true, 
            "tradeEnable": true
        }, 
        "userInfoDO": {
            "activeStatus": 0, 
            "companyPurchaseUser": false, 
            "loginMember": false, 
            "loginUserType": "buyer", 
            "success": true, 
            "userId": 0
        }
    }, 
    "isSuccess": true
}

我们发现JSON的内容非常多,我们要是每个都解析,岂不是很累?这里我们只需要获取price的信息,也就是priceInfo,所以我们想寻求一种方法,类似XPath的方式解析,这里我们采用JSONPath。

参考:https://github.com/DarrenChanChenChi/jsonpath

 

用法和XPath大同小异。

解析出我们想要的代码即可。

整体代码

common.go:

package main

import (
    "github.com/djimenez/iconv-go"
    "time"
    "net"
    "net/http"
    "gopkg.in/xmlpath.v2"
    "strings"
    "fmt"
    "math/rand"
)

type Msg struct{
    AdID int64 `json:"ad_id"`
    SourceID int64 `json:"source_id"`
    Source string `json:"source"`
    ItemID int64 `json:"item_id"`
    URL string `json:"url"`
    UID int64 `json:"uid"`
    DID int64 `json:"did"`
}

func convFromGbk(s string) string {
    gbkConvert, _ := iconv.NewConverter("gbk", "utf-8")
    res, _ := gbkConvert.ConvertString(s)
    return res
}

func newHTTPClient() *http.Client {
    client := &http.Client{
        Transport: &http.Transport{
            Dial: func(netw, addr string) (net.Conn, error) {
                return net.DialTimeout(netw, addr, time.Duration(1500*time.Millisecond))
            },
            MaxIdleConnsPerHost: 200,
        },
        Timeout: time.Duration(1500 * time.Millisecond),
    }
    return client
}

//只获取首元素
func parseNode(node *xmlpath.Node, xpath string) string {
    path, err := xmlpath.Compile(xpath)
    if err != nil {
        fmt.Errorf("%s",err)
        return ""
    }

    it := path.Iter(node)
    for it.Next() {
        s := strings.TrimSpace(it.Node().String())
        if len(s) != 0 {
            //return convFromGbk(s)
            return s
        }
    }
    return ""
}

//获取所有元素
func parseNodeForAll(node *xmlpath.Node, xpath string) []string {
    path, err := xmlpath.Compile(xpath)
    if err != nil {
        fmt.Errorf("%s",err)
        return nil
    }

    it := path.Iter(node)
    elements := []string{}
    for it.Next() {
        s := strings.TrimSpace(it.Node().String())
        if len(s) != 0 {
            //return convFromGbk(s)
            elements = append(elements, s)
        }
    }
    return elements
}

// percent returns the possibility of pct
func percent(pct int) bool {
    if pct < 0 || pct > 100 {
        return false
    }
    return pct > rand.Intn(100)
}

ali_spider.go:

package main

import (
    "code.byted.org/gopkg/logs"
    "encoding/json"
    "fmt"
    "github.com/djimenez/iconv-go"
    "github.com/ngaut/logging"
    "github.com/oliveagle/jsonpath"
    "gopkg.in/xmlpath.v2"
    "io/ioutil"
    "math/rand"
    "net/http"
    "strconv"
    "strings"
)

const itemURLPatternAli = "https://detail.tmall.com/item.htm?id=%d"
const priceURLPatternAli = "https://mdskip.taobao.com/core/initItemDetail.htm?isUseInventoryCenter=false&cartEnable=true&service3C=false&isApparel=true&isSecKill=false&tmallBuySupport=true&isAreaSell=false&tryBeforeBuy=false&offlineShop=false&itemId=%d&showShopProm=false&isPurchaseMallPage=false&itemGmtModified=1555201252000&isRegionLevel=false&household=false&sellerPreview=false&queryMemberRight=true&addressLevel=2&isForbidBuyItem=false&callback=setMdskip&timestamp=1555210888509&isg=bBQF1SmIvk4dQ8UGBOCNIZNDTp7T7IRAguWjmN99i_5Qy1Y_p8_OlZkxNev6Vj5RsG8p46-P7M29-etfw&isg2=BPPzr6M1qyiTZGdgYB4puOBagvEXdGgbstRSkqWQUpJJpBNGLPrUOlF1XpTvBN_i"


var ualist
= []string{ "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)", "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20", "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36", }

type AliSpider struct { client
*http.Client } func NewAliSpider() *AliSpider { return &AliSpider{ client: newHTTPClient(), } } func (j *AliSpider) loadPage(url string) (*xmlpath.Node, error) { req, err := http.NewRequest("GET", url, nil) if err != nil { return nil, err } req.Header.Set("User-Agent", ualist[rand.Intn(len(ualist))]) rsp, err := j.client.Do(req) if err != nil { return nil, err } //转码 utfBody, err := iconv.NewReader(rsp.Body, "gb2312", "utf-8") //if body, err := ioutil.ReadAll(utfBody); err == nil { // fmt.Println("HTML content:", string(body)) //} node, err := xmlpath.ParseHTML(utfBody) rsp.Body.Close() return node, err } func (j *AliSpider) parsePrice(itemID int64) (map[string]map[string]float64, error) { priceURL := fmt.Sprintf(priceURLPatternAli, itemID) req, err := http.NewRequest("GET", priceURL, nil) if err != nil { return nil, err } req.Header.Set("User-Agent", ualist[rand.Intn(len(ualist))]) referer := fmt.Sprintf("https://detail.tmall.com/item.htm?id=%d", itemID) req.Header.Set("Referer", referer) rsp, err := j.client.Do(req) if err != nil { return nil, err } priceInfoRaw, err := ioutil.ReadAll(rsp.Body) if err != nil { return nil, err } priceInfo := string(priceInfoRaw) jsonStr := convFromGbk(priceInfo) leftIndex := strings.Index(jsonStr, "(") + 1 rightIndex := strings.Index(jsonStr, ")") var json_data interface{} json.Unmarshal([]byte(jsonStr[leftIndex:rightIndex]), &json_data) skuQuantity, err := jsonpath.JsonPathLookup(json_data, "$.defaultModel.inventoryDO.skuQuantity") if err != nil { logs.Info("json path is err, err is %v", err) } skuQuantityMap := skuQuantity.(map[string]interface{}) itemPriceResultMap := map[string]map[string]float64{} itemPriceResultDetailMap := map[string]float64{} for skuQuantityId, _ := range skuQuantityMap { //fmt.Println(key, value) jpathPrice := fmt.Sprintf("$.defaultModel.itemPriceResultDO.priceInfo.%s.price", skuQuantityId) jpathPromotionPrice := fmt.Sprintf("$.defaultModel.itemPriceResultDO.priceInfo.%s.promotionList[0].price", skuQuantityId) price, err := jsonpath.JsonPathLookup(json_data, jpathPrice) if err != nil { logs.Info("jpathPrice is err, err is %v", err) } promotionPrice, err := jsonpath.JsonPathLookup(json_data, jpathPromotionPrice) if err != nil { logs.Info("jpathPromotionPrice is err, err is %v", err) } priceStr := price.(string) promotionPriceStr := promotionPrice.(string) itemPriceResultDetailMap["price"], _ = strconv.ParseFloat(priceStr, 64) itemPriceResultDetailMap["promotion_price"], _ = strconv.ParseFloat(promotionPriceStr, 64) itemPriceResultMap[skuQuantityId] = itemPriceResultDetailMap } return itemPriceResultMap, err } func (j *AliSpider) Parse(msg *Msg) (map[string]interface{}, error) { defer func() { if r := recover(); r != nil { logging.Errorf("parse msg %v, error %v", *msg, r) return } }() itemURL := fmt.Sprintf(itemURLPatternAli, msg.ItemID) node, err := j.loadPage(itemURL) if err != nil { fmt.Errorf("%s",err) return nil, err } //metricsClient.EmitCounter("jd_spider", 1, "", map[string]string{"step": "parse"}) name := parseNode(node, "//h1[@data-spm]") //详情描述 /** 产品名称:纽曼 品牌: 纽曼 型号: EX16 功能: 睡眠监测 计步 防水 */ details := parseNodeForAll(node, "//ul[@id=\"J_AttrUL\"]/li") detailsMap := make(map[string]string, len(details)) for _, detail := range details { split := strings.Split(detail, ":") if(len(split) > 1){ detailsMap[split[0]] = strings.TrimSpace(split[1]) } } shopname := parseNode(node, "//a[@class=\"slogo-shopname\"]") //描述 服务 物流 shopinfos := parseNodeForAll(node, "//span[@class=\"shopdsr-score-con\"]") describe, _ := strconv.ParseFloat(shopinfos[0], 64) service, _ := strconv.ParseFloat(shopinfos[1], 64) logistics, _ := strconv.ParseFloat(shopinfos[2], 64) //价格(多个型号,price是标准价格,promotion_price是促销价格) //map[4023134073248:map[price:3299.00 promotion_price:3299.00] 4023134073249:map[price:3299.00 promotion_price:3299.00] 4200326178501:map[promotion_price:3299.00 price:3299.00] 4023134073246:map[price:3299.00 promotion_price:3299.00] 4023134073247:map[price:3299.00 promotion_price:3299.00] 4023134073245:map[price:3299.00 promotion_price:3299.00] 4023134073250:map[price:3299.00 promotion_price:3299.00]] itemPriceResultMap, err := j.parsePrice(msg.ItemID) res := map[string]interface{}{} res["source"] = "Ali" res["source_id"] = msg.SourceID res["id"] = msg.ItemID res["ad_id"] = msg.AdID res["url"] = itemURL res["name"] = name res["details"] = detailsMap res["shopname"] = shopname res["describe"] = describe res["service"] = service res["logistics"] = logistics res["uid"] = msg.UID res["did"] = msg.DID res["item_price"] = itemPriceResultMap // 选几个必须包含的类别校验 if res["name"] == "" && res["shopname"] == "" { return nil, fmt.Errorf("invalid html page %s", itemURL) } return res, nil }

 

ali_spider_test.go:

package main

import (
    "encoding/json"
    "fmt"
    "strconv"
    "strings"
    "testing"
)

func TestName(t *testing.T) {
    //conf, err := ssconf.LoadSsConfFile(confFile)
    //if err != nil {
    //    panic(err)
    //}
    aliSpider := NewAliSpider()
    //554867117919 585758506034
    var itemId int64 = 7664169349
    itemURL := fmt.Sprintf(itemURLPatternAli, itemId)
    node, err := aliSpider.loadPage(itemURL)
    if err != nil {
        fmt.Errorf("%s",err)
    }
    //fmt.Println(node)
    name := parseNode(node, "//h1[@data-spm]")
    //详情描述
    /**
    产品名称:纽曼
    品牌: 纽曼
    型号: EX16
    功能: 睡眠监测 计步 防水
     */
    details := parseNodeForAll(node, "//ul[@id=\"J_AttrUL\"]/li")
    detailsMap := make(map[string]string, len(details))
    for _, detail := range details {
        split := strings.Split(detail, ":")
        if(len(split) > 1){
            detailsMap[split[0]] = strings.TrimSpace(split[1])
        }
    }

    shopname := parseNode(node, "//a[@class=\"slogo-shopname\"]")

    //描述 服务 物流
    shopinfos := parseNodeForAll(node, "//span[@class=\"shopdsr-score-con\"]")
    describe, _ := strconv.ParseFloat(shopinfos[0], 64)
    service, _ := strconv.ParseFloat(shopinfos[1], 64)
    logistics, _ := strconv.ParseFloat(shopinfos[2], 64)
    //价格(多个型号,price是标准价格,promotion_price是促销价格)
    //map[4023134073248:map[price:3299.00 promotion_price:3299.00] 4023134073249:map[price:3299.00 promotion_price:3299.00] 4200326178501:map[promotion_price:3299.00 price:3299.00] 4023134073246:map[price:3299.00 promotion_price:3299.00] 4023134073247:map[price:3299.00 promotion_price:3299.00] 4023134073245:map[price:3299.00 promotion_price:3299.00] 4023134073250:map[price:3299.00 promotion_price:3299.00]]
    itemPriceResultMap, err := aliSpider.parsePrice(itemId)

    res := map[string]interface{}{}
    res["source"] = "Ali"
    res["url"] = itemURL
    res["name"] = name
    res["details"] = detailsMap
    res["shopname"] = shopname
    res["describe"] = describe
    res["service"] = service
    res["logistics"] = logistics
    res["item_price"] = itemPriceResultMap

    bytes, err := json.Marshal(res)
    if err != nil {
        fmt.Println("error is ", err)
    }
    fmt.Println(string(bytes))
}

运行结果:

{"describe":4.9,"details":{"上市时间":"2014年冬季","乒乓底板材质":"其他","品牌":"Palio/拍里奥","型号":"TNT-1","层数":"9层","拍柄重量":"头沉柄轻","是否商场同款":"是","系列":"拍里奥TNT-1","货号":"TNT-1","颜色分类":"TNT-1直拍(短柄)1只+赠送:1海绵护边【7木+2碳】 TNT-1横拍(长柄)1只+赠送:1海绵护边【7木+2碳】 新TNT直拍(短柄)1只+赠送:1海绵护边【5木+2碳】 新TNT横拍(长柄)1只+赠送:1海绵护边【5木+2碳】"},"item_price":{"3280089025135":{"price":168,"promotion_price":68},"3280089025136":{"price":168,"promotion_price":68},"6310159781":{"price":168,"promotion_price":68},"6310159797":{"price":168,"promotion_price":68}},"logistics":4.8,"name":"正品 拍里奥乒乓球底板新TNT-1碳素快攻弧圈乒乓球拍底板球拍球板","service":4.8,"shopname":"玺源运动专营店","source":"Ali","url":"https://detail.tmall.com/item.htm?id=7664169349"}

 

[爬虫]采用Go语言爬取天猫页面

标签:rom   tail   regional   x11   EDA   mod   eid   .json   gecko   

原文地址:https://www.cnblogs.com/DarrenChan/p/10706019.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!