标签:
BeautifulSoup简称bs,是一个用来分析提取网页有用信息的工具,我个人认为正则表达式对于网页分析提取信息无所不能,但如果每个爬虫都用正则来写太费力,bs相对来说比正则方便很多,对于编写规范的网页来说bs确实很好用,对有一些编写不规范的网页,bs就很容易出错,而正则对于编写不规范的网站就显得强大很多,正则和bs各有利弊,不过大多情况下,网站的编写都还是很规范的,所以还是用bs比较多。
关于网页分析的工具不止bs,Scrapy框架用的是xpath,其实完全可以用bs来替代xpath,这类工具就看个人喜好了。
很简单的一个爬虫,用bs实现起来方便不少。
import urllib2
from bs4 import BeautifulSoup
import socket
from time import time
start = time()
baseurl = "http://jandan.net/ooxx/page-%s"
list = []
def user_agent(url):
req_header = {‘User-Agent‘:‘Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6‘}
req_timeout = 20
try:
req = urllib2.Request(url,None,req_header)
page = urllib2.urlopen(req,None,req_timeout)
html = page
except urllib2.URLError as e:
print e.message
except socket.timeout as e:
user_agent(url)
return html
def page_loop(pageid):
url = baseurl%pageid
print url
page = user_agent(url)
soup = BeautifulSoup(page)
total_img = 0
img = soup.find_all([‘img‘])
for myimg in img:
jpgUrl = myimg.get(‘src‘)
total_img +=1
print jpgUrl
#urllib.urlretrieve(jpgUrl,‘D:/Python/picture‘+‘/‘+jpgUrl[-11:])
data = urllib2.urlopen(jpgUrl).read()
with open(‘D:/Python/picture‘+‘/‘+jpgUrl[-11:],‘wb‘)as code:
code.write(data)
print total_img
page_start = 1000
page_stop = 1005
if __name__ == ‘__main__‘:
for pageid in range(page_start,page_stop):
page_loop(pageid)
print ‘Yong shi : %s‘%(time()-start)
标签:
原文地址:http://www.cnblogs.com/pylab/p/4621626.html