标签:
参考链接:http://python.jobbole.com/81351/#comment-93968
主要参考自伯乐在线的内容,但是该链接博客下的源码部分的正则表达式部分应该是有问题,试了好几次,没试成功。后来在下面的评论中看到有个使用BeautifulSoup的童鞋,试了试,感觉BeautifulSoup用起来确实很便捷。
1 # -*- coding:utf-8 -*- 2 3 ‘‘‘ 4 Author:LeonWen 5 ‘‘‘ 6 7 import urllib 8 import urllib2 9 # import re 10 from bs4 import BeautifulSoup 11 12 page = 1 13 url = ‘http://www.qiushibaike.com/hot/page/‘ + str(page) 14 # set the headers 15 user_agent = ‘Mozilla/4.0(compatible;MSIE 5.5;Windows NT)‘ 16 headers = {‘User-Agent‘:user_agent} 17 try: 18 request = urllib2.Request(url,headers=headers) 19 response = urllib2.urlopen(request) 20 object_bs = BeautifulSoup(response.read()) 21 # print object_bs.prettify() 22 # items 是一个list保存着返回结果 23 items = object_bs.body.find_all("div",{"class":"article block untagged mb15"}) 24 # print items 25 floor = 1 26 tag = 0 27 for item in items: 28 if item.find("div",{"class":"thumb"}) == None: 29 # class=thumb为带有图片的标签 30 author = item.find("h2") 31 upNum = item.find("i",{"class":"number"}) 32 content = item.find("div",{"class":"content"}) 33 # print content.prettify() 34 # print content.text 35 print u"===============",floor,u" 楼 =======================" 36 print u"作者:",author.text 37 print u"赞同数:",upNum.text 38 print u"内容:",content.get_text() 39 floor += 1 40 else: 41 tag += 1 42 print u"图片个数:",tag 43 except urllib2.URLError,e: 44 if hasattr(e,"code"): 45 print e.code 46 if hasattr(e,"reason"): 47 print e.reason
原文地址:http://www.cnblogs.com/leonwen/p/5721843.html
标签:
原文地址:http://www.cnblogs.com/leonwen/p/5721843.html