爬虫学习之下载韩寒博客

时间：2016-07-21 22:02:50 阅读：143 评论：0 收藏：0 [点我收藏+]

标签：

1.打开韩寒博客列表页面

http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html

目标是获取所有文章的超级链接

2.韩寒文章列表特征

3.技术要点

　·字符串函数find

　·列表 list[-x:-y]

　·文件读写

#coding:utf-8
import urllib
import time
url = [‘‘]*350
page = 1
link = 1
while page <= 7:
    con = urllib.urlopen(‘http://blog.sina.com.cn/s/articlelist_1191258123_0_‘+ str(page) +‘.html‘).read()
    title = con.find(r‘<a title‘)
    href = con.find(r‘href=‘,title)
    html = con.find(r‘.html‘,href)
    i = 0
    while title != -1 and href != -1 and html != -1 and i < 80:
        url[i] = con[href + 6:html +5]
        print link,‘   ‘,url[i]
        i = i + 1
        title = con.find(r‘<a title‘,html)
        href = con.find(r‘href=‘,title) 
        html = con.find(r‘.html‘,href)
        link = link + 1
    
    else:
        print page,‘find end!‘
    page = page + 1
j = 0
while j < 350:
    content = urllib.urlopen(url[j]).read()
    open(r‘blog/‘+url[j][-26:],‘w+‘).write(content)
    j = j + 1
    time.sleep(1)
else:
    print ‘download article finished!‘

　·循环体while

4.实现步骤

·能够在浏览器打开韩寒博客文章列表首页的博客网页

·从首页网页里获得博客上的所有文章链接

·所有文章列表网页里的文章链接

·下载所有链接HTML文件

爬虫学习之下载韩寒博客

标签：

原文地址：http://www.cnblogs.com/fjl-vxee/p/5693201.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行