python来获取网页中的所有链接

时间：2020-05-07 23:09:21 阅读：90 评论：0 收藏：0 [点我收藏+]

标签：selenium none beautiful txt 网络 port hao123 请求 www

注意：使用前要装selenium第三方的库才可以使用

版本：python3

from bs4 import BeautifulSoup
from urllib import request

# 要请求的网络地址
url = ‘https://www.hao123.com/‘

# 请求网络地址得到html网页代码
html = request.urlopen(url)

# 整理代码
soup = BeautifulSoup(html, ‘html.parser‘)

# 找出所有的 a 标签， 因为所有的链接都在 a 标签内
data = soup.find_all(‘a‘)

# 打开文件对象做持久化操作
file = open(‘D:/link.txt‘, mode=‘w‘, encoding=‘utf-8‘)

# 遍历所有的 a 标签， 获取它们的 href 属性的值和它们的 text
for item in data:
    if item.string is not None and item[‘href‘] != ‘javascript:;‘ and item[‘href‘] != ‘#‘:
        print(item.string, item.get(‘href‘))
        file.write(str.__add__(item.string, ‘ ‘))
        file.write(str.__add__(item[‘href‘], ‘\n‘))

file.close()

python来获取网页中的所有链接

标签：selenium none beautiful txt 网络 port hao123 请求 www

原文地址：https://www.cnblogs.com/li1234567980/p/12846077.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行