python学习之----BeautifulSoup小示例

时间：2017-03-18 18:16:55 阅读：194 评论：0 收藏：0 [点我收藏+]

标签：ons urlopen 获取网页 page title 标记 body read 指定

BeautifulSoup 库最常用的对象恰好就是BeautifulSoup 对象。

from urllib.request import urlopen

from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page1.html")

bsObj = BeautifulSoup(html.read())

print(bsObj.h1)

bsObj.tagname只能获取页面中的第一个指定的标签tagname

输出结果是：

<h1>An Interesting Title</h1>

和前面例子一样，我们导入urlopen，然后调用html.read() 获取网页的HTML 内容。这

样就可以把HTML 内容传到BeautifulSoup 对象，转换成下面的结构：

? html → <html><head>...</head><body>...</body></html>

— head → <head><title>A Useful Page<title></head>

— title → <title>A Useful Page</title>

— body → <body><h1>An Int...</h1><div>Lorem ip...</div></body>

— h1 → <h1>An Interesting Title</h1>

— div → <div>Lorem Ipsum dolor...</div>

可以看出，我们从网页中提取的<h1> 标签被嵌在BeautifulSoup 对象bsObj 结构的第二层

（html → body → h1）。但是，当我们从对象里提取h1 标签的时候，可以直接调用它：

bsObj.h1

其实，下面的所有函数调用都可以产生同样的结果：

bsObj.html.body.h1

bsObj.body.h1

bsObj.html.h1

希望这个例子可以向你展示BeautifulSoup 库的强大与简单。其实，任何HTML（或

XML）文件的任意节点信息都可以被提取出来，只要目标信息的旁边或附近有标记就行。

python学习之----BeautifulSoup小示例

标签：ons urlopen 获取网页 page title 标记 body read 指定

原文地址：http://www.cnblogs.com/yintingting/p/6574891.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行