python爬虫beautifulsoup4系列1

时间：2017-05-27 22:27:32 阅读：312 评论：0 收藏：0 [点我收藏+]

标签：wfs XML ccm img crs pmt facl utils cve

前言

以博客园为例，爬取我的博客上首页的发布时间、标题、摘要，本篇先小试牛刀，先了解下它的强大之处，后面讲beautifulsoup4的详细功能。

一、安装

1.打开cmd用pip在线安装beautifulsoup4

>pip install beautifulsoup4

技术分享

二、解析器

1.我们主要用第一个html.parser，这个是python的标准库，可以直接用。其它几个需要安装对应解析器，

下表列出了主要的解析器,以及它们的优缺点:

技术分享

三、打印首页博客的时间

1.这里直接定位不好定位到，可以先定位它的父元素：class="dayTitle"

技术分享

2.用requests里的get方法打开博客首页，r.content返回整个html内容，返回类型为string

3.查找所有的class属性为dayTitle的Tag类

4.获取当前Tag的标签为a的string值

技术分享

四、打印摘要

1.获取标题方法跟上面一样，获取摘要的话，这里不太一样，这个父类<div class="c_b_p_desc">下多了一个子类a

技术分享

2.先获取div这个Tag类，tag的 .contents 属性可以将tag的子节点以列表的方式输出

3.因为摘要可以看成是第一个子元素，取下标[0]就可以读出来

技术分享

五、参考代码

# coding:utf-8
from bs4 import BeautifulSoup
import requests

r = requests.get("http://www.cnblogs.com/yoyoketang/")
# 请求首页后获取整个html界面
blog = r.content
# print blog
# 用html.parser解析html
soup = BeautifulSoup(blog, "html.parser")
# 获取所有的class属性为dayTitle，返回Tag类
times = soup.find_all(class_="dayTitle")
# for i in times:
#     print i.a.string # 获取a标签的文本

title = soup.find_all(class_="postTitle")
# for i in title:
#     print i.a.string

# 读取摘要内容
descs = soup.find_all(class_="postCon")
# for i in descs:
#     # tag的 .contents 属性可以将tag的子节点以列表的方式输出
#     c = i.div.contents[0] # 取第一个
#     print c

for i, j, k in zip(times,title,descs):
    print i.a.string
    print j.a.string
    print k.div.contents[0]
    print ""
技术分享

对python接口自动化有兴趣的，可以加python接口自动化QQ群：226296743

也可以关注下我的个人公众号：

技术分享

python爬虫beautifulsoup4系列1

标签：wfs XML ccm img crs pmt facl utils cve

原文地址：http://www.cnblogs.com/yoyoketang/p/6901443.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行