码迷,mamicode.com
首页 > 其他好文 > 详细

爬取博客数据

时间:2016-04-24 11:10:29      阅读:117      评论:0      收藏:0      [点我收藏+]

标签:

 

 

 

#coding:utf-8

import urllib
import time

url = [‘‘]*350
page = 1
link = 1
while page <= 7:
    con = urllib.urlopen(http://blog.sina.com.cn/s/articlelist_1191258123_0_+str(page)+.html).read()
    i = 0
    title = con.find(r<a title=)
    href = con.find(rhref=,title)
    html = con.find(r.html,href)

    while title != -1 and href != -1 and html != -1 and i < 50:
        url[i] = con[href + 6 : html + 5]
        print link, url[i]
        content = urllib.urlopen(url[i]).read()
        open(rhanhan/+url[i][-26:],w+).write(content)
        print downloading, url[i]
        time.sleep(1)
        title = con.find(r<a title=, html)
        href = con.find(rhref=, title)
        html = con.find(r.html, href)
        i = i + 1
        link = link + 1
    else:
        print page,find end!
    page = page + 1
else:
    print all find end
    print all find end

 

爬取博客数据

标签:

原文地址:http://www.cnblogs.com/XDJjy/p/5426510.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!