python cralwer (爬虫)心得

时间：2015-05-26 10:49:11 阅读：190 评论：0 收藏：0 [点我收藏+]

最近用python做了个小crawler，可以自动整理一些网站的内容，推送到当地文件中，做个小小的总结。

主要lib就是urllib 和 beautifulsoup.

urllib和urllib2是很方便的网页提取库，核心就是发送各种自定义的url request,然后可以返回网页内容。最简单的函数，判定一个网页是否存在：

def isUrlExists(url):
  req = urllib2.Request(url, headers=headers)
  try:
    urllib2.urlopen(req)
  except:
    return 0;
  return 1;

headers可以自定义，也可以留空。自定义的主要目的是模仿成一般浏览器的header，绕过一些网站对crawler的封锁。

如果想获得网站内容，并且获取返回异常的内容，可以这样：

def fetchLink(url):
  req = urllib2.Request(url, headers=headers)
  try:
    response = urllib2.urlopen(req)
  except urllib2.URLError, e:
    print 'Got Url Error while retrieving: ', url, ' and the exception is: ', e.reason
  except urllib2.HTTPError, e:
    print 'Got Http Error while retrieving: ', url,  ' with reponse code: ', e.getcode(), ' and exception: ', e.reason
  else:
    htmlContent = response.read()
    return htmlContent

以上代码直接返回html。

BeautifulSoup (documentaion: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ )是一个简洁的html分析工具库。获得了html后，可以直接用python自带的正则表达式来获得想要的信息, 但是略显繁琐。Beautifulshop直接将html 用类似json的方法分析好，形成标准的树状结构，可以直接进行获取元素的操作。另外，还支持元素的搜索等等。

  content = bs4.BeautifulSoup(content,from_encoding='GB18030')
  posts = content.find_all(class_='post-content')
  for post in posts:
    postText = post.find(class_='entry-title').get_text()

这个例子中content先被转化为bs4对象，然后找到所有class=post-content的区块，再获得其中class=entry-title的文字。注意，第一行中parse的时候可以选择encoding，这里是用的是简体中文。

以上都是html text内容的获取。如果其中有图片，以上代码会直接生成图片连接到原来的图片位置。如果需要进行任何下载，可以使用urlretrieve方法。这里就不多说了。

python cralwer (爬虫)心得

标签：python tools

原文地址：http://blog.csdn.net/cykic/article/details/46003389

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行