Python网络爬虫与信息提取（二）—— BeautifulSoup

时间：2017-09-30 20:54:43 阅读：272 评论：0 收藏：0 [点我收藏+]

标签：comment 学习者 ring python nav 网络 XML 信息文件

Boautiful Soup

BeautifulSoup官方介绍：

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.

官方网站：https://www.crummy.com/software/BeautifulSoup/

1、安装

在"C:\Windows\System32"中找到"cmd.exe"，使用管理员身份运行，在命令行中输入：“pip install beautifulsoup4”运行。

提示pip版本过低，使用 python -m pip install --upgrade pip 进行升级。

C:\Windows\system32>pip install beautifulsoup4
Requirement already satisfied (use --upgrade to upgrade): beautifulsoup4 in c:\users\lei\appdata\local\programs\python\p
ython35\lib\site-packages\beautifulsoup4-4.5.0-py3.5.egg
You are using pip version 8.1.1, however version 9.0.1 is available.
You should consider upgrading via the ‘python -m pip install --upgrade pip‘ command.

Beautiful Soup库的安装测试：

演示HTML页面地址:http://www.cnblogs.com/yan-lei


>>> import requests
>>> r = requests.get("http://www.cnblogs.com/yan-lei/")
>>> r.text
‘\r\n\r\n\r\n\r\n\r\n\r\nPython学习者 - 博客园\r\n>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup







Python学习者 - 博客园......

from bs4 import BeautifulSoup soup = BeautifulSoup(‘

data

‘,‘html.parser‘)

2、Beautiful Soup库的使用

以HTML为例，任何HTML文件都是有一组"<>"组织起来的，其实就是标签，标签之间形成了上下游关系，形成了标签树。BeautifulSoup库是解析、遍历、维护“标签树”的功能库

<p>..</p>:标签Tag

标签Name一般成对出现
属性Attributes 0个或多个

Beautiful Soup库的引用

Beautiful Soup库，也叫beautfulsoup4 或bs4。约定引用方式如下，即主要是用BeautifulSoup类。

from bs4 import BeautifulSoup import bs4

Beautiful Soup类

将标签树转换为BeautifulSoup类，此时我们将HTML、标签树、BeautifulSoup类等价

from bs4 import BeautifulSoup soup1 = BeautifulSoup("data","html.parser") soup2 = BeautifulSoup(open("D://demo.html","html.parser"))

BeautifulSoup对应一个HTML/XML文档的全部内容。

Beautiful Soup库解析器

soup = BeautifulSoup(‘data‘,‘html.parser‘)

解析器	使用方法	条件
bs4的HTML解析器	BeautifulSoup(mk,‘html.parser‘)	安装bs4库
lxml的HTML解析器	BeautifulSoup(mk,‘lxml‘)	pip install lxml
lxml的XML解析器	BeautifulSoup(mk,‘xml‘)	pip install lxml

newsoup = BeautifulSoup("

This is not a comment

","html.parser") Beautiful Soup类的基本元素

基本元素	说明
Tag	标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾
Name	标签的名字，<p>...</p>的名字是‘p‘，格式：<tag>.name
Attributes	标签的属性，字典形式的组织，格式：<tag>.attrs
NavigleString	标签内非属性字符串，<>...</>中字符串，格式<tag>.string
Comment	标签内字符串的注释部分，一种特殊的Comment类型

Python网络爬虫与信息提取（二）—— BeautifulSoup

标签：comment 学习者 ring python nav 网络 XML 信息文件

原文地址：http://www.cnblogs.com/yan-lei/p/7615902.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行