标签:
原理和上章获取段子一样,只不过是换了解析的内容。
代码:
#-*- coding: utf-8 -*-
import urllib2
import re
def GetPageContent(page_url,heads):
try:
req = urllib2.Request(page_url,headers=heads)
resp = urllib2.urlopen(req)
return resp.read().decode('utf8')
except Exception, e:
print "Request [%s] error. -> "%(page_url), e
return ""
def GetTopNotes(cont):
strRe = '.*?<li>.*?data-user-slug="(.*?)"'
strRe += '.*?<h4>.*?<a.*?href="(.*?)".*?>(.*?)</a>'
strRe += '.*?class="fa fa-comments-o".*?>.*?</i>(.*?)</a>'
strRe += '.*?<a.*?id="like-note".*?</i>(.*?)</a>'
pat = re.compile(strRe, re.S)
items = re.findall(pat,cont)
for item in items:
for i in item:
print "".join(i.split())
print '==================================='
if __name__ == '__main__':
url = 'http://www.jianshu.com/trending/now'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent':user_agent}
cont = GetPageContent(url, headers)
cont = cont[cont.find('<ul class="top-notes ranking">')::]
GetTopNotes(cont)
输出:
C:\Python27\python.exe F:/SrcCode/Python/GetNewlyJokes/JianShuSpider.py 4c4231dc6796 /p/0aabe4120b78 下水道的秘密 48 20 =================================== 564d899d4d3c /p/8af1ad733670 蝉鸣的夏季我想遇见你 117 71 =================================== a36e18ccb59d /p/f9e60eb98a28 再见,爱过的人 8 46 =================================== bcfca792018f /p/9fa6b6e58fd0 我们曾相遇,想到就心酸(三十五) 19 27 =================================== 2870cb3c6f77 /p/8329df311356 最佳情人 39 288 =================================== dc22650a4033 /p/f7f39b72fdb2 【连载】触不到的女神(10) 31 21 ===================================内容一次为:作者id,文章链接,文章标题,评论数,收到的喜欢数
标签:
原文地址:http://blog.csdn.net/arbboter/article/details/45958207