Python抓取简书的热门文章

时间：2015-05-25 10:02:53 阅读：166 评论：0 收藏：0 [点我收藏+]

标签：

原理和上章获取段子一样，只不过是换了解析的内容。

代码：

#-*- coding: utf-8 -*-
import urllib2
import re

def GetPageContent(page_url,heads):
    try:
        req = urllib2.Request(page_url,headers=heads)
        resp = urllib2.urlopen(req)
        return resp.read().decode('utf8')
    except Exception, e:
        print "Request [%s] error. -> "%(page_url), e
        return ""

def GetTopNotes(cont):

    strRe = '.*?<li>.*?data-user-slug="(.*?)"'
    strRe += '.*?<h4>.*?<a.*?href="(.*?)".*?>(.*?)</a>'
    strRe += '.*?class="fa fa-comments-o".*?>.*?</i>(.*?)</a>'
    strRe += '.*?<a.*?id="like-note".*?</i>(.*?)</a>'

    pat = re.compile(strRe, re.S)
    items = re.findall(pat,cont)

    for item in items:
        for i in item:
            print "".join(i.split())
        print '==================================='

if __name__ == '__main__':
    url = 'http://www.jianshu.com/trending/now'
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers = {'User-Agent':user_agent}

    cont = GetPageContent(url, headers)
    cont = cont[cont.find('<ul class="top-notes ranking">')::]
    GetTopNotes(cont)

输出：

C:\Python27\python.exe F:/SrcCode/Python/GetNewlyJokes/JianShuSpider.py
4c4231dc6796
/p/0aabe4120b78
下水道的秘密
48
20
===================================
564d899d4d3c
/p/8af1ad733670
蝉鸣的夏季我想遇见你
117
71
===================================
a36e18ccb59d
/p/f9e60eb98a28
再见，爱过的人
8
46
===================================
bcfca792018f
/p/9fa6b6e58fd0
我们曾相遇，想到就心酸（三十五）
19
27
===================================
2870cb3c6f77
/p/8329df311356
最佳情人
39
288
===================================
dc22650a4033
/p/f7f39b72fdb2
【连载】触不到的女神（10）
31
21
===================================

内容一次为：作者id，文章链接，文章标题，评论数，收到的喜欢数

Python抓取简书的热门文章

标签：

原文地址：http://blog.csdn.net/arbboter/article/details/45958207

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行