【Python3 爬虫】Beautiful Soup库的使用

时间：2018-03-28 20:27:14 阅读：178 评论：0 收藏：0 [点我收藏+]

标签：数据 info 多少 install div style tle 网页抓取 blog

之前学习了正则表达式，但是发现如果用正则表达式写网络爬虫，那是相当的复杂啊！于是就有了Beautiful Soup

简单来说，Beautiful Soup是python的一个库，最主要的功能是从网页抓取数据。

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

安装Beautiful Soup

使用命令安装

pip install beautifulsoup4

出现上述截图表示已经成功安装

Beautiful Soup的使用

1.首先必须先导入BS4库

from bs4 import BeautifulSoup

2.定义html内容（为后边的例子演示做准备）

下面的一段HTML代码将作为例子被多次用到.这是 爱丽丝梦游仙境的 的一段内容(以后内容中简称为 爱丽丝 的文档):

html = """
<html><head><title>The Dormouse‘s story</title></head>
<body>
<p class="title"><b>The Dormouse‘s story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

3.创建beautifulsoup 对象

#创建BeautifulSoup对象
soup = BeautifulSoup(html)
"""
若html内容存在文件a.html中，那么可以这么创建BeautifulSoup对象
soup = BeautifulSoup(open(a.html))
"""

4.格式化输出

#格式化输出
print(soup.prettify())

输出结果：

5.Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构

每个节点都是Python对象,所有对象可以归纳为4种:

Tag
NavigableString
BeautifulSoup
Comment

（1）Tags

Tags是 HTML 中的一个个标签，例如:

<a></a>

<p></p>

…

等都是标签

下面感受一下怎样用 Beautiful Soup 来方便地获取 Tags

#获取tags
print(soup.title)
#运行结果：<title>The Dormouse‘s story</title>
print(soup.head)
#运行结果：<head><title>The Dormouse‘s story</title></head>
print(soup.a)
#运行结果：<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
print(soup.p)
#运行结果：<p class="title"><b>The Dormouse‘s story</b></p>

不过有一点是，它查找的是在所有内容中的第一个符合要求的标签，看<a>标签的输出结果就可以明白了！

我们可以使用type来验证以下这些标签的类型

#看获取Tags的数据类型
print(type(soup.title))
#运行结果：<class ‘bs4.element.Tag‘>

对于Tags，还有2个属性，name跟attrs

#查看Tags的两个属性name、attrs
print(soup.a.name)
#运行结果：a
print(soup.a.attrs)
#运行结果：{‘href‘: ‘http://example.com/elsie‘, ‘class‘: [‘sister‘], ‘id‘: ‘link1‘}

从上面的输出结果我们可以看到标签<a>的attrs属性输出结果是一个字典，我们要想获取字典中的具体的值可以这样

p = soup.a.attrs
print(p[‘class‘])
#print(p.get(‘class‘)) 与上述方法等价
#运行结果：[‘sister‘]

（2）NavigableString

我们已经获取了Tags了，那么我们如何来获取Tags中的内容呢？

#获取标签内部的文字(NavigableString)
print(soup.a.string)
#运行结果：Elsie

同样的，我们也可以通过type来查看他的类型

print(type(soup.a.string))
#运行结果：<class ‘bs4.element.NavigableString‘>

（3）BeautifulSoup

soup本身也是有这两个属性的，只是比较特殊而已

#查看BeautifulSoup的属性
print(soup.name)
#运行结果：[document]
print(soup.attrs)
#运行结果：{}

（4）Comment

我们把上述html中的这一段修改为下面这个样子（把<a></a>标签中的内容修改为注释内容）

<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>

我们可以使用Comment同样提取被注释的内容

#获取标签内部的文字
print(soup.a.string)
#运行结果：Elsie

查看其类型

print(type(soup.a.string))
#运行结果：<class ‘bs4.element.Comment‘>

【Python3 爬虫】Beautiful Soup库的使用

标签：数据 info 多少 install div style tle 网页抓取 blog

原文地址：https://www.cnblogs.com/OliverQin/p/8665448.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行