码迷,mamicode.com
首页 > 其他好文 > 详细

简单的大众点评爬虫

时间:2014-12-12 22:08:27      阅读:288      评论:0      收藏:0      [点我收藏+]

标签:des   style   blog   http   ar   color   os   sp   for   

一个很简单的爬虫,爬取中大周边地点的点评信息。

# -*- coding: utf-8 -*-
import requests
import re
import time

def placeSplider(name, star, url):
    time.sleep(5)
    res = requests.get(http://www.dianping.com+url)
    text = res.text
    longInfo = "<p class=\"desc J-desc\">(.*?)</p>"
    longInfo_re = re.compile(longInfo, re.DOTALL)
    longInfos = longInfo_re.findall(text)
    
    info = "sml-rank-stars sml-str(.*?)\".*?<p class=\"desc\">(.*?)</p>"
    info_re = re.compile(info, re.DOTALL)
    results = info_re.findall(text)
    #print result
    #print ‘%d results‘ %len(results)
    if len(results) == 0 or len(results[0]) < 2 or results[0][1].count(u人点评) > 0:
        print u没有点评\n
        return
    fOut = open(D:\\%s.txt %name, w)
    fOut.write(place star %s\n %star)
    for result in results:
        star = result[0]
        info = result[1]
        if info.count(<span) > 0 or info.count(u仅售)>0:#去广告
            print ‘‘
            break
        else:
            if info[-6:] == u"......":#替换短评论为相应的长评论
                info = info[:-6]
                for i in longInfos:
                    if i.count(info) > 0:
                        info = i
                        break
            info = info.replace("<br/>", ‘‘)
            info = info.replace("<br>", ‘‘)
            info = info.replace("&nbsp;", ‘‘)
            print star, info
            fOut.write(%s\n %star)
            fOut.write(%s\n %info.encode(u8))
    fOut.close()

for page in range(1, 6):
    res = requests.get(http://www.dianping.com/search/keyword/206/0_%E4%B8%AD%E5%B1%B1%E5%A4%A7%E5%AD%A6/p+str(page))
    text = res.text
    href = "data-hippo-type=\"shop\" title=\"(.*?)\" target=\"_blank\" href=\"(.*?)\".*?sml-rank-stars sml-str(.*?)\""
    href_re = re.compile(href, re.DOTALL)
    result =  href_re.findall(text)
    for place in result:
        name = place[0]
        url = place[1]
        star = place[2]
        print name, star, url
        placeSplider(name, star, url)
    time.sleep(5)

 

简单的大众点评爬虫

标签:des   style   blog   http   ar   color   os   sp   for   

原文地址:http://www.cnblogs.com/instant7/p/4160448.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!