爬取博客数据

时间：2016-04-24 11:10:29 阅读：117 评论：0 收藏：0 [点我收藏+]

标签：

#coding:utf-8

import urllib
import time

url = [‘‘]*350
page = 1
link = 1
while page <= 7:
    con = urllib.urlopen(‘http://blog.sina.com.cn/s/articlelist_1191258123_0_‘+str(page)+‘.html‘).read()
    i = 0
    title = con.find(r‘<a title=‘)
    href = con.find(r‘href=‘,title)
    html = con.find(r‘.html‘,href)

    while title != -1 and href != -1 and html != -1 and i < 50:
        url[i] = con[href + 6 : html + 5]
        print link, url[i]
        content = urllib.urlopen(url[i]).read()
        open(r‘hanhan/‘+url[i][-26:],‘w+‘).write(content)
        print ‘downloading‘, url[i]
        time.sleep(1)
        title = con.find(r‘<a title=‘, html)
        href = con.find(r‘href=‘, title)
        html = con.find(r‘.html‘, href)
        i = i + 1
        link = link + 1
    else:
        print page,‘find end!‘
    page = page + 1
else:
    print ‘all find end‘
    print ‘all find end‘

爬取博客数据

标签：

原文地址：http://www.cnblogs.com/XDJjy/p/5426510.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行