python爬虫之解析库Beautiful Soup

时间：2018-07-11 16:34:33 阅读：203 评论：0 收藏：0 [点我收藏+]

Beautiful Soup4操作

为何要用Beautiful Soup

　　Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式，

　　是一个标签的形式，来进行查找的，有点像jquery的形式。提升效率，我们在进行爬虫开发的时候，进程会用到正则来进行查找过滤的操作，纯手动会及其浪费时间。

Beautiful Soup示例摘自官网

html_doc = """
<html><head><title>The Dormouse‘s story</title></head>
<body>
<p class="title"><b>The Dormouse‘s story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

这里先简单说明Beautiful Soup的查找方式，是一个标签树的形式。

在使用的时候实例化一个对象，这个对象就相当于整个html文件，将标签封装成对象的属性，查找的时候使用“.”

下面进行操作：

　　简单操作

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("html_doc.html"),"lxml")
#简单的操作
#打印html文件的title属性
#print(soup.title)
#<title>The Dormouse‘s story</title>

#打印标签的名字
# print(soup.title.name)
#title

# 打印标签的内容
# print(soup.title.string)
#The Dormouse‘s story

#打印soup中的p标签，但是这里是能找到的第一个
# print(soup.p)
# <p class="title"><b>The Dormouse‘s story</b></p>

#打印soup中的p标签class名字，但是这里是能找到的第一个
# print(soup.p[‘class‘],type(soup.p[‘class‘]))
# [‘title‘] <class ‘list‘> #类型是个列表

# 打印soup中的a标签，但是这里是能找到的第一个
# print(soup.a)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

#打印所有的a标签
# print(soup.find_all(‘a‘))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

#打印id=link3的标签
# print(soup.find(id="link3"))
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

#从文档中找到所有<a>标签的链接:
# for link in soup.find_all(‘a‘):
#     print(link.get(‘href‘))
    # http://example.com/elsie
    # http://example.com/lacie
    # http://example.com/tillie

# 从文档中获取所有文字内容:
# print(soup.get_text())
# The Dormouse‘s story
#
# The Dormouse‘s story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

Tag

soup1 = BeautifulSoup(‘<b class="boldest">Extremely bold</b>‘,"lxml")
tag = soup1.b  
# print(type(tag))
# <class ‘bs4.element.Tag‘>

Tag的Name属性

# print(tag.name)
# b
# 如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档:

# tag.name = "blockquote"
# print(tag)
# <blockquote class="boldest">Extremely bold</blockquote>

Tag的Attributes属性

一个tag可能有很多个属性. tag <b class="boldest"> 有一个 “class” 的属性,值为 “boldest” . tag的属性的操作方法与字典相同:
# print(tag[‘class‘])
# [‘boldest‘]

# 也可以直接”点”取属性, 比如: .attrs :
# print(tag.attrs)
# {‘class‘: [‘boldest‘]}
# print(soup.a.attrs[‘class‘])
# [‘sister‘]

# tag的属性可以被添加,删除或修改. 再说一次, tag的属性操作方法与字典一样

# tag[‘class‘] = ‘verybold‘
# tag[‘id‘] = 1
# print(tag)
# <blockquote class="verybold" id="1">Extremely bold</blockquote>

# del tag[‘class‘]
# del tag[‘id‘]
# print(tag)
# <blockquote>Extremely bold</blockquote>

# tag[‘class‘]
# KeyError: ‘class‘
# print(tag.get(‘class‘))
# None

子节点操作：

　　.contents属性

#.contents
# tag的 .contents 属性可以将tag的子节点以列表的方式输出:
# print(soup)
#print(soup.contents) #这里打印的是整个html标签
#print("________")
#print(soup.head.contents) #打印出来的是head下的列表，可以借助元组去重
##[‘\n‘, <meta charset="utf-8"/>, ‘\n‘, <title>The Dormouse‘s story</title>, ‘\n‘]
#print(len(soup.head.contents))
##5
#print(soup.head.contents[1].name)
##meta

解释器：

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup, "html.parser")`	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	`BeautifulSoup(markup, ["lxml-xml"])` `BeautifulSoup(markup, "xml")`	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

python爬虫之解析库Beautiful Soup

标签：parse utf-8 .com HERE 标准库属性转换容错 com

原文地址：https://www.cnblogs.com/taozizainali/p/9295117.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行