Python每日一练(2):找出html中的所有链接（Xpath、正则两个版本）

时间：2016-01-20 12:36:36 阅读：206 评论：0 收藏：0 [点我收藏+]

标签：

要在hrml文件中找出特定的内容，首先需要观察该内容是什么东西，在什么位置，这样才能找出来。

假设html的文件名称是:"1.html"、href属性全都在a标签里。

正则版：

#coding:utf-8
import re

with open(‘1.html‘,‘r‘) as f:
    data = f.read()

result = re.findall(r‘href="(.*?)"‘,data)
for each in result:
    print each

Xpath版：

#coding:utf-8
from lxml import etree

with open(‘1.html‘, ‘r‘) as f:
    data = f.read()

selector = etree.HTML(data)

result = selector.xpath(‘//a/@href‘)
for each in result:
    print each

Xpath版比正则版多了一行····，这个html文件似乎有点长贴上来的时候显示502，求解。

似乎编辑器没有上传附件的地方？

Python每日一练(2):找出html中的所有链接（Xpath、正则两个版本）

标签：

原文地址：http://www.cnblogs.com/sxcmos/p/5144542.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行