码迷,mamicode.com
首页 > 编程语言 > 详细

python系列之(1)BeautifulSoup的用法

时间:2019-10-18 19:29:16      阅读:103      评论:0      收藏:0      [点我收藏+]

标签:query   aci   取出   features   常用   正则表达式   for   int   pre   

好久没更新博客了。打算写一个python的爬虫系列及数据分析。falg也不能随便立,以免打脸。

python爬取内容,是过程,分析数据是结果,最终得出结论才是目的。python爬虫爬取了内容,一般都是从网页上获取,那我们从html页面中如何提取出自己想要的信息呢?那就需要解析。目前常用的有BeautifulSoup、PyQuery、XPath和正则表达式。正则容易出错,而且一直是弱项,就讲讲其他三个的使用,今天先看下BeautifulSoup.

一、简介

BeautifulSoup直译为美丽的汤。是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式。

二、安装

 pip install beautifulsoup4

三、准备测试代码

这是 爱丽丝梦游仙境的 的一段内容(以后内容中简称为 爱丽丝 的文档)

<html><head><title>The Dormouses story</title></head>
<body>
<p class="title"><b>The Dormouses story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>

我们先以上述代码为例进行测试

四、使用

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse‘s story</title></head>
<body>
<p class="title"><b>The Dormouse‘s story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, features="html.parser")
#print(soup.prettify())

print(soup.title)
#<title>The Dormouse‘s story</title>
print(soup.title.name)
#title
print(soup.title.string)
#The Dormouse‘s story
print(soup.title.parent.name)
#head

print(soup.p)
#<p class="title"><b>The Dormouse‘s story</b></p>
print(soup.p[class])
#[u‘title‘]

print(soup.a)
#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
print(soup.find_all(a))
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
print(soup.find(id=link3))
#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

for link in soup.find_all(a):
print(link.get(href))
#http://example.com/elsie
#http://example.com/lacie
#http://example.com/tillie

print(soup.get_text())
#The Dormouse‘s story

#The Dormouse‘s story
#Once upon a time there were three little sisters; and their names were
#Elsie,
#Lacie and
#Tillie;
#and they lived at the bottom of a well.
#...

以上注释的都是上一行输出的

 

五、BeautifulSoup可以传入字符串或文件句柄

from bs4 import BeautifulSoup

soup = BeautifulSoup(<b class="boldest">Extremely bold</b>, features="lxml")
tag = soup.b
print(tag)
#<b class="boldest">Extremely bold</b>
tag.name = "blockquote"
print(tag)
#<blockquote class="boldest">Extremely bold</blockquote>
print(tag[class])
#[‘boldest‘]
print(tag.attrs)
#{‘class‘: [‘boldest‘]}
tag[id]="stylebs"
print(tag)
#<blockquote class="boldest" id="stylebs">Extremely bold</blockquote>
del tag[id] 
print(tag)
#<blockquote class="boldest">Extremely bold</blockquote>
        
css_soup = BeautifulSoup(<p class="body strikeout"></p>, features="lxml")
print(css_soup.p[class])
#[‘body‘, ‘strikeout‘]

id_soup = BeautifulSoup(<p id="my id"></p>, features="lxml")
print(id_soup.p[id])
#my id 
    
rel_soup = BeautifulSoup(<p>Back to the <a rel="index">homepage</a></p>, features="lxml")
print(rel_soup.a[rel])
#[‘index‘]
rel_soup.a[rel] = [index, contents]
print(rel_soup.p)
        

 

 

参考文档 : https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id40

 

python系列之(1)BeautifulSoup的用法

标签:query   aci   取出   features   常用   正则表达式   for   int   pre   

原文地址:https://www.cnblogs.com/kumufengchun/p/11699687.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!