python抓取网页内容

时间：2014-10-02 19:45:23 阅读：405 评论：0 收藏：0 [点我收藏+]

标签：python 网页抓取

最近想在网上抓取数据做研究，刚好会一点python,就让我们来看一种比较简单的实现方法。

比如我要抓取奥巴马每周的演讲内容http://www.putclub.com/html/radio/VOA/presidentspeech/index.html，如果手动提取，就需要一个个点进去，再复制保存，非常麻烦。

那有没有一步到位的方法呢，用python这种强大的语言就能快速实现。

首先我们看看这网页的源码

bubuko.com,布布扣

可以发现，我们要的信息就在这样 bubuko.com,布布扣一小条url中。

更具体点说，就是我们要遍历每个类似http://www.putclub.com/html/radio/VOA/presidentspeech/2014/0928/91326.html这样的网址，而这网址需要从上面的网页中提取。

好，开始写代码

首先打开这个目录页，保存在content

import sys,urllib
url="http://www.putclub.com/html/radio/VOA/presidentspeech/index.html"
wp = urllib.urlopen(url)
print "start download..."
content = wp.read()

下面要提取出每一篇演讲的内容

具体思路是搜索“center_box”之后，每个“href=”和“target”之间的内容。为什么是这两个之间，请看网页源码。

得到的就是每一篇的url，再在前面加上www.putclub.com就是每一篇文章的网址啦

print content.count("center_box")
index =  content.find("center_box")
content=content[content.find("center_box")+1:]
content=content[content.find("href=")+7:content.find("target")-2]
filename = content
url ="http://www.putclub.com/"+content
print content
print url
wp = urllib.urlopen(url)
print "start download..."
content = wp.read()

有了文章内容的url后，同样的方法筛选内容。

#print content
print content.count("<div class=\"content\"")
#content = content[content.find("<div class=\"content\""):]
content = content[content.find("<!--info end------->"):]
content = content[:content.find("<div class=\"dede_pages\"")-1]
filename = filename[filename.find("presidentspeech")+len("presidentspeech/"):]

最后再保存并打印

filename = filename.replace('/',"-",filename.count("/"))
fp = open(filename,"w+")
fp.write(content)
fp.close()
print content

OK，大功告成！保存成.pyw文件，以后只需双击就直接保存下了obama每周演讲内容~

python抓取网页内容

标签：python 网页抓取

原文地址：http://blog.csdn.net/zjccoder/article/details/39736875

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行