python中用lxml解析html

时间：2014-12-29 06:29:21 阅读：231 评论：0 收藏：0 [点我收藏+]

标签：

lxml，是python中用来处理xml和html的功能最丰富和易用的库。详情见：http://lxml.de/index.html。

在windows下安装lxml，可以用easy_install工具，也可以直接安装二进制文件。为了方便，我选择直接用二进制方式安装。

二进制文件的下载页面：https://pypi.python.org/pypi/lxml/3.4.1

选择合适的版本，因我的系统是win7，64位，python版本为2.7，所以我选择如下lxml版本。

技术分享

安装完成后，就可以开始python代码了：

import codecs
import sys
from lxml import etree

tree = etree.HTML(open(‘d:\\GitHub\\python27\\simple.html‘,‘r‘).read())

nodes = tree.xpath("//div[@id=‘name‘]")
print(nodes[0]).text

用到的html文件：

<!DOCTYPE html>
<html>
<head>
<title>This is a simple html file</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
</head>
<body>
<div id="container">
    <div id="name" class="item">勇者面码</div>
    <div id="sex">女</div>
    <div id="borth">9.18</div>
</div>
</body>
</html>

用lxml来解析，不会因为文档头小写而解析失败。

技术分享

python中用lxml解析html

标签：

原文地址：http://www.cnblogs.com/menma/p/4190919.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行