BeautifulSoup4的find_all()和select()，简单爬虫学习

时间：2019-11-03 14:54:59 阅读：415 评论：0 收藏：0 [点我收藏+]

正则表达式+BeautifulSoup爬取网页可事半功倍。

就拿百度贴吧网址来练练手：https://tieba.baidu.com/index.html

1.find_all()：搜索当前节点的所有子节点，孙子节点。

下面例子是用find_all()匹配贴吧分类模块，href链接中带有“娱乐”两字的链接。

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

f = urlopen(‘https://tieba.baidu.com/index.html‘).read()
soup = BeautifulSoup(f,‘html.parser‘)

for link in soup.find_all(‘a‘,href=re.compile(‘娱乐‘)):
    print(link.get(‘title‘)+‘:‘+link.get(‘href‘))

结果：
娱乐明星:/f/index/forumpark?pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1
港台东南亚明星:/f/index/forumpark?cn=港台东南亚明星&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1
内地明星:/f/index/forumpark?cn=内地明星&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1
韩国明星:/f/index/forumpark?cn=韩国明星&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1
日本明星:/f/index/forumpark?cn=日本明星&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1
时尚人物:/f/index/forumpark?cn=时尚人物&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1
欧美明星:/f/index/forumpark?cn=欧美明星&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1
主持人:/f/index/forumpark?cn=主持人&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1
其他娱乐明星:/f/index/forumpark?cn=其他娱乐明星&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1

soup.find_all(‘a‘,href=re.compile(‘娱乐‘)) 等效于：soup(‘a‘,href=re.compile(‘娱乐‘))
上面的例子也可以用soup代替。

2.用select()循环你需要的内容：

** 搜索html页面中a标签下以“/f/index”开头的href：

for link2 in soup.select(‘a[href^="/f/index"]‘):
    print(link2.get(‘title‘)+‘:‘+link2.get(‘href‘))


**搜索html页面中a标签下以“&pn=1”结尾的href：

for link2 in soup.select(‘a[href$="&pn=1"]‘):
    print(link2.get(‘title‘)+‘:‘+link2.get(‘href‘))


**搜索html页面中a标签下包含“娱乐”的href：

for link3 in soup.select(‘a[href*="娱乐"]‘):
    print(link3.get(‘title‘)+‘:‘+link3.get(‘href‘))

BeautifulSoup4的find_all()和select()，简单爬虫学习

标签：lin print for pil url 简单 int 循环分类

原文地址：https://www.cnblogs.com/suancaipaofan/p/11786046.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行