标签:
经过一段时间的学习,终于入了门
先爬一个csdn 的blog练练手
整体思路是首先判断某个blog有多少页
然后根据页数 去获得相应的url
再爬出每一页的title和对应的url
这里使用了BeautifulSoup来解析页面
#coding=utf-8 import urllib2 from bs4 import BeautifulSoup import sys reload(sys) sys.setdefaultencoding(‘utf-8‘) def query_item(input,cla=None): ‘‘‘ 获取对应url中 某个标签 class的对象 ‘‘‘ soup=BeautifulSoup(input,"html.parser") if cla==None: return soup.find_all(‘div‘) else: return soup.find_all(‘div‘,class_=cla) ‘‘‘ http://blog.csdn.net/zhaoyl03/article/list/1 ‘‘‘ url="http://blog.csdn.net/zhaoyl03/article/list/1" req_header = { ‘Host‘:"blog.csdn.net", ‘User-Agent‘:"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36", ‘Accept‘:"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", ‘Accept-Language‘:"zh-CN,zh;q=0.8", ‘Connection‘:"keep-alive", "Cache-Control":"max-age=0", "Referer":"http://blog.csdn.net"} blog_art=[] i=1
#该循环是获取最大页面数,并将获取的页面放入一个list中 while True: url="http://blog.csdn.net/zhaoyl03/article/list/" req=urllib2.Request(url+str(i),None,req_header) result = urllib2.urlopen(req,None) artcle_num=query_item(result.read(),‘list_item article_item‘) if len(artcle_num)<15: for x in artcle_num: blog_art.append(x) break else: i+=1 for x in artcle_num: blog_art.append(x) #现在得到blog的有效页数 i 和所有的博文 blog_art host_url=‘http://blog.csdn.net‘ query_result={} for x in blog_art: for y in x.find(‘span‘,‘link_title‘): #得到所有博文的title和url query_result[str(y.get_text())]=str(host_url+y.get(‘href‘)) print len(query_result) for x,y in query_result.items(): print x+‘:‘+y
结果如下:
Open source robotics toolkits: use virtual arenas to test your robotics algorithms :http://blog.csdn.net/zhaoyl03/article/details/8179441 设计模式: 观察者模式 :http://blog.csdn.net/zhaoyl03/article/details/40223067 霍布斯:人对人像狼一样 :http://blog.csdn.net/zhaoyl03/article/details/8158739 ChiMerge 算法: 以鸢尾花数据集为例 :http://blog.csdn.net/zhaoyl03/article/details/8689440 用python 写爬虫,去爬csdn的内容,完美解决 403 Forbidden :http://blog.csdn.net/zhaoyl03/article/details/8631897 使用 Matlab 的 bvp4c 求解边值问题 :http://blog.csdn.net/zhaoyl03/article/details/8153140 牛顿下山法 :http://blog.csdn.net/zhaoyl03/article/details/8228732 小玩意系列:增强Windows运行栏的功能 (一) :http://blog.csdn.net/zhaoyl03/article/details/8887157 UltraEdit中使用正则表达式替换 :http://blog.csdn.net/zhaoyl03/article/details/8432129 苏格拉底:自知其无知 :http://blog.csdn.net/zhaoyl03/article/details/8158793 如何从一个文件中删除另一个文件的重复项 :http://blog.csdn.net/zhaoyl03/article/details/8188264 向 PPT 表格中添加行或列 :http://blog.csdn.net/zhaoyl03/article/details/8156308 编程之美“字符串移位包含的问题”的另一种解法 :http://blog.csdn.net/zhaoyl03/article/details/8656755 C++ 文件结束符 :http://blog.csdn.net/zhaoyl03/article/details/8165989 MATLAB中的一些小技巧 :http://blog.csdn.net/zhaoyl03/article/details/8155941 Mathematica中清除一系列符号定义的函数 :http://blog.csdn.net/zhaoyl03/article/details/8205689 Talking about the Computational Future at SXSW 2013 :http://blog.csdn.net/zhaoyl03/article/details/8822284 Bruno Buchberger: A life devoted to symbolic computation :http://blog.csdn.net/zhaoyl03/article/details/8612627 妙用Windows“运行” :http://blog.csdn.net/zhaoyl03/article/details/8874937 MATLAB 函数句柄的用法 :http://blog.csdn.net/zhaoyl03/article/details/8215588 Clenshaw–Curtis quadrature :http://blog.csdn.net/zhaoyl03/article/details/8500408 有关cin.fail,cin.clear,cin.sync的应用 :http://blog.csdn.net/zhaoyl03/article/details/8167049 Visual Studio Command Window :http://blog.csdn.net/zhaoyl03/article/details/8144816 学习札记:cin.clear(istream::failbit) :http://blog.csdn.net/zhaoyl03/article/details/8197649 BloomFilter(布隆过滤器) :http://blog.csdn.net/zhaoyl03/article/details/8653391 海量数据处理(一) :http://blog.csdn.net/zhaoyl03/article/details/8684006 数据库和数据仓库的区别 :http://blog.csdn.net/zhaoyl03/article/details/8655596 设计模式:单例模式 :http://blog.csdn.net/zhaoyl03/article/details/40264363 C++ typedef用法详解 :http://blog.csdn.net/zhaoyl03/article/details/8195621 Python写爬虫——抓取网页并解析HTML :http://blog.csdn.net/zhaoyl03/article/details/8631645 Ubuntu上搭建Hadoop环境(单机模式+伪分布模式) :http://blog.csdn.net/zhaoyl03/article/details/8657104 Tex中的正则表达式替换 :http://blog.csdn.net/zhaoyl03/article/details/8686915 常用DOS命令大全 :http://blog.csdn.net/zhaoyl03/article/details/8144856 Java的第一个程序 :http://blog.csdn.net/zhaoyl03/article/details/8457074 卢梭:人无往不在枷锁之中 :http://blog.csdn.net/zhaoyl03/article/details/8158752 Mathematica 函数调用发生异常时停止计算 :http://blog.csdn.net/zhaoyl03/article/details/8191083 阿达(Ada Lovelace) :http://blog.csdn.net/zhaoyl03/article/details/8279768 学习札记: C++指向字符数组的指针 :http://blog.csdn.net/zhaoyl03/article/details/8274575 小玩意系列:增强Windows运行栏的功能 (二) :http://blog.csdn.net/zhaoyl03/article/details/8887724 Python 排序 :http://blog.csdn.net/zhaoyl03/article/details/8683091 使用Python实现Hadoop MapReduce程序 :http://blog.csdn.net/zhaoyl03/article/details/8657031 数学之美番外篇:平凡而又神奇的贝叶斯方法 :http://blog.csdn.net/zhaoyl03/article/details/8655464 制作网页访问者的地图 :http://blog.csdn.net/zhaoyl03/article/details/8531409 C++的atof() :http://blog.csdn.net/zhaoyl03/article/details/8176387 数据挖掘学习札记:KNN算法(二) :http://blog.csdn.net/zhaoyl03/article/details/8679256 数据挖掘学习札记:ID3算法(一) :http://blog.csdn.net/zhaoyl03/article/details/8665663 C/C++编译器-cl.exe的命令选项 :http://blog.csdn.net/zhaoyl03/article/details/8144675 Excel表格乘法函数公式 :http://blog.csdn.net/zhaoyl03/article/details/8208537 使用 windbg 分析 minidump :http://blog.csdn.net/zhaoyl03/article/details/8217337 优秀asp.net程序员修炼之路 :http://blog.csdn.net/zhaoyl03/article/details/8456466 在 CSDN 网页上插入数学公式 :http://blog.csdn.net/zhaoyl03/article/details/8153608 使用Chebfun求解Blasius方程(二) :http://blog.csdn.net/zhaoyl03/article/details/8266419 Python与简单网络爬虫的编写 :http://blog.csdn.net/zhaoyl03/article/details/8631928 [学者笔谈]史占中:大国崛起:从中国制造到中国智造 :http://blog.csdn.net/zhaoyl03/article/details/8177741 关于Mathematica系统通讯机制MathLink的研究 :http://blog.csdn.net/zhaoyl03/article/details/8181690 Physicists Discover a Whopping 13 New Solutions to Three-Body Problem :http://blog.csdn.net/zhaoyl03/article/details/8822310 如何利用Mathematica调用C编写的函数 :http://blog.csdn.net/zhaoyl03/article/details/8181706 小玩意系列:Python调用Google翻译 :http://blog.csdn.net/zhaoyl03/article/details/8830806 初窥Applet :http://blog.csdn.net/zhaoyl03/article/details/8810940 查尔斯·巴贝奇——计算机先驱者之父 :http://blog.csdn.net/zhaoyl03/article/details/8279940 Lobatto quadrature :http://blog.csdn.net/zhaoyl03/article/details/8155438 Matlab 中输入希腊字母 :http://blog.csdn.net/zhaoyl03/article/details/8147696 EXCEL如何设置打印区域 :http://blog.csdn.net/zhaoyl03/article/details/8144595 批处理for命令详解 :http://blog.csdn.net/zhaoyl03/article/details/8886067 sizeof :http://blog.csdn.net/zhaoyl03/article/details/9090639 cin.get,cin.clear以及cin.sync :http://blog.csdn.net/zhaoyl03/article/details/8167024 数据挖掘学习札记:KNN算法(一) :http://blog.csdn.net/zhaoyl03/article/details/8666906 Chebyshev 展开 :http://blog.csdn.net/zhaoyl03/article/details/8494474 Python yield :http://blog.csdn.net/zhaoyl03/article/details/8683936 苏格拉底:“认识你自己” :http://blog.csdn.net/zhaoyl03/article/details/8158812 cin.get()、流和缓冲区 :http://blog.csdn.net/zhaoyl03/article/details/8165889 C++使用system带参数调用exe :http://blog.csdn.net/zhaoyl03/article/details/8176699 数据挖掘学习札记:KNN算法(三) :http://blog.csdn.net/zhaoyl03/article/details/8679378 学习札记: C++指向函数的指针 :http://blog.csdn.net/zhaoyl03/article/details/8195922 OpenCL开发案例学习 :http://blog.csdn.net/zhaoyl03/article/details/8517369 使用Chebfun求解Blasius方程(一) :http://blog.csdn.net/zhaoyl03/article/details/8263627 在网页上嵌入搜索和访问计数器 :http://blog.csdn.net/zhaoyl03/article/details/8524693 Shanks transformation :http://blog.csdn.net/zhaoyl03/article/details/8607019 [Finished in 2.5s]
标签:
原文地址:http://www.cnblogs.com/csy2994/p/4737305.html