Python写的网络爬虫程序（很简单）

时间：2014-11-27 22:09:33 阅读：273 评论：0 收藏：0 [点我收藏+]

标签：style blog http os sp for strong 文件 on

Python写的网络爬虫程序（很简单）

这是我的一位同学传给我的一个小的网页爬虫程序，觉得挺有意思的，和大家分享一下。不过有一点需要注意，要用python2.3，如果用python3.4会有些问题出现。

python程序如下：

import re,urllib
strTxt=""
x=1
ff=open("wangzhi.txt","r")

for line in ff.readlines():
	f=open(str(x)+".txt","w+")
	print line
	n=re.findall(r"<p>(.*?)<\/p>",urllib.urlopen(line).read(),re.M)
	for i in n:
		if len(i)!=0:
			i=i.replace(" ","")
			i= i.replace("<strong>","")
                        i = i.replace("</strong>","")
                        strTxt = strTxt + i
                        strTxt = re.sub(r"<a href=(.*?)>", r"", strTxt)
                        strTxt=re.sub(r"<a(.*?)>",r"",strTxt)
                        strTxt=re.sub(r"<span>(.*?)</span>",r"", strTxt)
                        strTxt = re.sub(r"<\/[Aa]>", r"", strTxt)
                #print strTxt
                f.write(strTxt)
                strTxt=""
        f.close
        x=x+1
ff.close()
</span>

wangzhi.txt的内容如下：

http://sports.163.com/14/1126/22/AC0TVK4E00052UUC.html
http://sports.163.com/14/1126/22/AC0TGD4700052UUC.html
http://sports.163.com/14/1126/22/AC0TAHNK00052UUC.html

结果分析：

运行程序，有3个输出文件，分别是3个URL地址对应的网页的内容。

Python写的网络爬虫程序（很简单）

标签：style blog http os sp for strong 文件 on

原文地址：http://blog.csdn.net/sxhlovehmm/article/details/41553705

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行