标签:query aci 取出 features 常用 正则表达式 for int pre
好久没更新博客了。打算写一个python的爬虫系列及数据分析。falg也不能随便立,以免打脸。
python爬取内容,是过程,分析数据是结果,最终得出结论才是目的。python爬虫爬取了内容,一般都是从网页上获取,那我们从html页面中如何提取出自己想要的信息呢?那就需要解析。目前常用的有BeautifulSoup、PyQuery、XPath和正则表达式。正则容易出错,而且一直是弱项,就讲讲其他三个的使用,今天先看下BeautifulSoup.
一、简介
BeautifulSoup直译为美丽的汤。是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式。
二、安装
pip install beautifulsoup4
三、准备测试代码
这是 爱丽丝梦游仙境的 的一段内容(以后内容中简称为 爱丽丝 的文档)
<html><head><title>The Dormouse‘s story</title></head> <body> <p class="title"><b>The Dormouse‘s story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body> </html>
我们先以上述代码为例进行测试
四、使用
from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse‘s story</title></head> <body> <p class="title"><b>The Dormouse‘s story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body> </html> """ soup = BeautifulSoup(html_doc, features="html.parser") #print(soup.prettify()) print(soup.title) #<title>The Dormouse‘s story</title> print(soup.title.name) #title print(soup.title.string) #The Dormouse‘s story print(soup.title.parent.name) #head print(soup.p) #<p class="title"><b>The Dormouse‘s story</b></p> print(soup.p[‘class‘]) #[u‘title‘] print(soup.a) #<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> print(soup.find_all(‘a‘)) #[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] print(soup.find(id=‘link3‘)) #<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> for link in soup.find_all(‘a‘): print(link.get(‘href‘)) #http://example.com/elsie #http://example.com/lacie #http://example.com/tillie print(soup.get_text()) #The Dormouse‘s story #The Dormouse‘s story #Once upon a time there were three little sisters; and their names were #Elsie, #Lacie and #Tillie; #and they lived at the bottom of a well. #...
以上注释的都是上一行输出的
五、BeautifulSoup可以传入字符串或文件句柄
from bs4 import BeautifulSoup soup = BeautifulSoup(‘<b class="boldest">Extremely bold</b>‘, features="lxml") tag = soup.b print(tag) #<b class="boldest">Extremely bold</b> tag.name = "blockquote" print(tag) #<blockquote class="boldest">Extremely bold</blockquote> print(tag[‘class‘]) #[‘boldest‘] print(tag.attrs) #{‘class‘: [‘boldest‘]} tag[‘id‘]="stylebs" print(tag) #<blockquote class="boldest" id="stylebs">Extremely bold</blockquote> del tag[‘id‘] print(tag) #<blockquote class="boldest">Extremely bold</blockquote> css_soup = BeautifulSoup(‘<p class="body strikeout"></p>‘, features="lxml") print(css_soup.p[‘class‘]) #[‘body‘, ‘strikeout‘] id_soup = BeautifulSoup(‘<p id="my id"></p>‘, features="lxml") print(id_soup.p[‘id‘]) #my id rel_soup = BeautifulSoup(‘<p>Back to the <a rel="index">homepage</a></p>‘, features="lxml") print(rel_soup.a[‘rel‘]) #[‘index‘] rel_soup.a[‘rel‘] = [‘index‘, ‘contents‘] print(rel_soup.p)
参考文档 : https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id40
标签:query aci 取出 features 常用 正则表达式 for int pre
原文地址:https://www.cnblogs.com/kumufengchun/p/11699687.html