用python2.7,采集新浪博客

时间：2015-10-04 12:25:58 阅读：254 评论：0 收藏：0 [点我收藏+]

标签：

#coding=utf-8       #新浪博客     
import urllib
import re
import os
url=[‘‘]*1500 #每一骗博客的地址
title=[‘‘]*1500  #每一篇博客的标题
page=1  #博客分页
count=1  #文章计数
while page<=9:
	con=urllib.urlopen(‘http://blog.sina.com.cn/s/articlelist_1193491727_0_‘+str(page)+‘.html‘).read()
	i=0	
	hrefstart=con.find(r‘href="http://blog.sina.com.cn/s/blog_‘)
	print hrefstart
	hrefend=con.find(r‘.html‘,hrefstart)
	print hrefend
	titlestart=con.find(r‘>‘,hrefend)
	print titlestart
	titleend=con.find(r‘</a>‘,titlestart)
	print titleend

	while i<=50 and titleend!=-1 and hrefend!=-1:
		url[i]=con[hrefstart+6:hrefend+5]
		title[i]=con[titlestart:titleend]
		print page,i,count, title[i]
		print url[i]
		hrefstart=con.find(r‘href="http://blog.sina.com.cn/s/blog_‘,titleend)
		hrefend=con.find(r‘.html‘,hrefstart)
		titlestart=con.find(r‘>‘,hrefend)
		titleend=con.find(r‘</a>‘,titlestart)
		content=urllib.urlopen(url[i]).read()
		filename=url[i][-26:]
		print filename
		if not os.path.isdir("1"):
		   os.mkdir("1")
		target=open(‘1/‘+filename,‘w‘)
		target.write(content)
		i=i+1
		count=count+1		
	else:
		print page,‘本页查找到结尾了‘
	page=page+1
else:
	print‘本次任务结束了‘

用python2.7,采集新浪博客，王石的博客文章。

实现了文章列表多页采集，实现了下载到本地。

练手之做，如果有更好的代码，也分享一些给我

欢迎交流　　

还有几点未作：

1、利用正则实现提取每一页的文章内容。

2、目录按照下载时间自动命名

用python2.7,采集新浪博客

标签：

原文地址：http://www.cnblogs.com/doyonevertodo/p/4854359.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行