标签:
虽然python解析xml的库很多,但是,由于lxml在底层是用C语言实现的,所以lxml在速度上有明显优势。除了速度上的优势,lxml在使用方面,易用性也非常好。这里将以下面的xml数据为例,介绍lxml的简单使用。
[html]?view plain?copy
?
例子:dblp.xml(dblp数据的片段)??
<?xml?version=‘1.0‘?encoding=‘utf-8‘?>????
<dblp>??
???????<article?mdate="2012-11-28"?key="journals/entropy/BellucciFMY08">????
????????<author>Stefano?Bellucci</author>????
????????<author>Sergio?Ferrara</author>????
????????<author>Alessio?Marrani</author>????
????????<author>Armen?Yeranyan</author>????
????????<title>ES<sup>2</sup>:?A?cloud?data?storage?system?for?supporting?both?OLTP?and?OLAP.</title>??
????????<pages>507-555</pages>????
????????<year>2008</year>????
????????<volume>10</volume>????
????????<journal>Entropy</journal>????
????????<number>4</number>????
????????<ee>http://dx.doi.org/10.3390/e10040507</ee>????
????????<url>db/journals/entropy/entropy10.html#BellucciFMY08</url>????
????</article>????
????<article?mdate="2013-03-04"?key="journals/entropy/Knuth13">????
????????<author>Kevin?H.?Knuth</author>????
????????<title><i>Entropy</i>?Best?Paper?Award?2013.</title>????
????????<pages>698-699</pages>????
????????<year>2013</year>????
????????<volume>15</volume>????
????????<journal>Entropy</journal>????
????????<number>2</number>????
????????<ee>http://dx.doi.org/10.3390/e15020698</ee>????
????????<url>db/journals/entropy/entropy15.html#Knuth13</url>????
????</article>????
</dblp>??
1、将xml解析为树结构,并得到该树的根。
为了将xml解析为树结构,并得到该树的根,要进行如下的操作:
[python]?view plain?copy
?
#!/usr/bin/python??
#-*-coding:utf-8-*-??
from?lxml?import?etree#导入lxml库??
tree?=?etree.parse("dblp.xml")#将xml解析为树结构??
root?=?tree.getroot()#获得该树的树根??
另外,如果xml数据中出现了关于dtd的声明(如下面的例子),那样的话,必须在使用lxml解析xml的时候,进行相应的声明。
[html]?view plain?copy
?
xml文件中含有dtd声明的例子:??
<?xml?version="1.0"?encoding="ISO-8859-1"?>??
<!DOCTYPE?dblp?SYSTEM?"dblp.dtd">??
<dblp>??
<article?mdate="2002-01-03"?key="persons/Codd71a">??
<author>E.?F.?Codd</author>??
<title>Further?Normalization?of?the?Data?Base?Relational?Model.</title>??
<journal>IBM?Research?Report,?San?Jose,?California</journal>??
<volume>RJ909</volume>??
<month>August</month>??
<year>1971</year>??
<a?href="http://lib.csdn.net/base/20"?class="replace_word"?title="Hadoop知识库"?target="_blank"?style="color:#df3434;?font-weight:bold;">hadoop</a>@hadoop:~/20130722dblpxml$?head?-15?dblp.xml???
<?xml?version="1.0"?encoding="ISO-8859-1"?>??
<!DOCTYPE?dblp?SYSTEM?"dblp.dtd">??
<dblp>??
<article?mdate="2002-01-03"?key="persons/Codd71a">??
<author>E.?F.?Codd</author>??
<title>Further?Normalization?of?the?Data?Base?Relational?Model.</title>??
<journal>IBM?Research?Report,?San?Jose,?California</journal>??
<volume>RJ909</volume>??
<month>August</month>??
<year>1971</year>??
<cdrom>ibmTR/rj909.pdf</cdrom>??
<ee>db/labs/ibm/RJ909.html</ee>??
</article>??
</dblp>??
这时候,要想将xml数据解析为树结构并得到该树的树根,必须进行如下的操作:
[python]?view plain?copy
?
#!/usr/bin/python??
#-*-coding:utf-8-*-??
from?lxml?import?etree#导入lxml库??
parser=etree.XMLParser(load_dtd=?True)#首先根据dtd得到一个parser(注意dtd文件要放在和xml文件相同的目录)??
tree?=?etree.parse("dblp.xml",parser)#用上面得到的parser将xml解析为树结构??
root?=?tree.getroot()#获得该树的树根??
2、遍历树结构,获得各元素的属性及其子元素。
[python]?view plain?copy
?
for?article?in?root:#这样便可以遍历根元素的所有子元素(这里是article元素)??
????print?"元素名称:",article.tag#用.tag得到该子元素的名称??
????for?field?in?article:#遍历article元素的所有子元素(这里是指article的author,title,volume,year等)??
????????print?field.tag,":",field.text#同样地,用.tag可以得到元素的名称,而.text可以得到元素的内容??
????mdate=article.get("mdate")#用.get("属性名")可以得到article元素相应属性的值??
????key=article.get("key")??
????print?"mdate:",mdate??
????print?"key",key??
????print?""#隔行分开不同的article元素??
到这里,便可以进行简单的xml数据的解析了。
3、解析xml数据的例子
用下面的代码解析文章开头的名为dblp.xml数据。
[python]?view plain?copy
?
#!/usr/bin/python??
#-*-coding:utf-8-*-??
from?lxml?import?etree#导入lxml库??
tree?=?etree.parse("dblp.xml")#将xml解析为树结构??
root?=?tree.getroot()#获得该树的树根??
??
?
for?article?in?root:#这样便可以遍历根元素的所有子元素(这里是article元素)??
????print?"元素名称:",article.tag#用.tag得到该子元素的名称??
????for?field?in?article:#遍历article元素的所有子元素(这里是指article的author,title,volume,year等)??
????????print?field.tag,":",field.text#同样地,用.tag可以得到元素的名称,而.text可以得到元素的内容??
????mdate=article.get("mdate")#用.get("属性名")可以得到article元素相应属性的值??
????key=article.get("key")??
????print?"mdate:",mdate??
????print?"key",key??
????print?""#隔行分开不同的article元素??
便可以得到输出如下:
[python]?view plain?copy
?
元素名称:?article??
author?:?Stefano?Bellucci??
author?:?Sergio?Ferrara??
author?:?Alessio?Marrani??
author?:?Armen?Yeranyan??
title?:?ES??
pages?:?507-555??
year?:?2008??
volume?:?10??
journal?:?Entropy??
number?:?4??
ee?:?http://dx.doi.org/10.3390/e10040507??
url?:?db/journals/entropy/entropy10.html#BellucciFMY08??
mdate:?2012-11-28??
key:?journals/entropy/BellucciFMY08??
??
?
??
?
元素名称:?article??
author?:?Kevin?H.?Knuth??
title?:?None??
pages?:?698-699??
year?:?2013??
volume?:?15??
journal?:?Entropy??
number?:?2??
ee?:?http://dx.doi.org/10.3390/e15020698??
url?:?db/journals/entropy/entropy15.html#Knuth13??
mdate:?2013-03-04??
key:?journals/entropy/Knuth13??
4、元素既有sub-element,又有text的处理
可以看到在上面的例子中,title元素的内容是不正确的。由于title元素及包含sub-element,又有text内容(如下),这时简单的用.text,并不能正确的得到title元素的内容。上面的例子中,第一个article元素的title只取到了ES,而第二个article元素的title则什么都没取到,None。
[python]?view plain?copy
?
<title>ES<sup>2</sup>:?A?cloud?data?storage?system?for?supporting?both?OLTP?and?OLAP.</title>??
<title><i>Entropy</i>?Best?Paper?Award?2013.</title>???
由于在这个例子中,子元素比较简单,这里就简单的采取将子元素和text一起打印的方法来解决这一问题。代码如下:
[python]?view plain?copy
?
#!/usr/bin/python??
#-*-coding:utf-8-*-??
from?lxml?import?etree#导入lxml库??
tree?=?etree.parse("dblp.xml")#将xml解析为树结构??
root?=?tree.getroot()#获得该树的树根??
??
?
for?article?in?root:#这样便可以遍历根元素的所有子元素(这里是article元素)??
????print?"元素名称:",article.tag#用.tag得到该子元素的名称??
????for?field?in?article:#遍历article元素的所有子元素(这里是指article的author,title,volume,year等)??
????????if?field.tag=="title":??
????????????print?field.tag,":",etree.tostring(field,encoding=‘utf-8‘,pretty_print=False)#将元素text连同sub_element一起打印??
????????else:??
????????????print?field.tag,":",field.text#同样地,用.tag可以得到元素的名称,而.text可以得到元素的内容??
????mdate=article.get("mdate")#用.get("属性名")可以得到article元素相应属性的值??
????key=article.get("key")??
????print?"mdate:",mdate??
????print?"key:",key??
????print?""#隔行分开不同的article元素??
输出如下:
[python]?view plain?copy
?
元素名称:?article??
author?:?Stefano?Bellucci??
author?:?Sergio?Ferrara??
author?:?Alessio?Marrani??
author?:?Armen?Yeranyan??
title?:?<title>ES<sup>2</sup>:?A?cloud?data?storage?system?for?supporting?both?OLTP?and?OLAP.</title>??
??????????
?
pages?:?507-555??
year?:?2008??
volume?:?10??
journal?:?Entropy??
number?:?4??
ee?:?http://dx.doi.org/10.3390/e10040507??
url?:?db/journals/entropy/entropy10.html#BellucciFMY08??
mdate:?2012-11-28??
key:?journals/entropy/BellucciFMY08??
??
?
元素名称:?article??
author?:?Kevin?H.?Knuth??
title?:?<title><i>Entropy</i>?Best?Paper?Award?2013.</title>????
??????????
?
pages?:?698-699??
year?:?2013??
volume?:?15??
journal?:?Entropy??
number?:?2??
ee?:?http://dx.doi.org/10.3390/e15020698??
url?:?db/journals/entropy/entropy15.html#Knuth13??
mdate:?2013-03-04??
key:?journals/entropy/Knuth13??
当然,不难看出这个问题用这种方法解决比较傻,后面还得将title内容中的tag等不需要部分通过各种字符串的处理将其去掉。最好的方法是能有比较简单的方法,分别获取到一个元素的text和sub_element。下面就讲一下如何实现这个需求:
5、sub_element和text优雅实现版
假设xml文件paper.xml内容如下:
[plain]?view plain?copy
?
<?xml?version="1.0"?encoding="ISO-8859-1"?>??
<dblp>??
????<article?mdate="2002-01-03"?key="persons/Codd71a">??
????????<author>E.?F.?Codd</author>??
????????<title>ES<sup>2</sup>:?A?cloud?data?storage?system?for?supporting?both?OLTP?and?OLAP.</title>??
????????<journal>IBM?Research?Report,?San?Jose,?California</journal>??
????????<volume>RJ909</volume>??
????????<month>August</month>??
????????<year>1971</year>??
????</article>??
????<article?mdate="2002-01-03"?key="persons/Codd71a">??
????????<author>E.?F.?Codd</author>??
????????<title><i>Entropy</i>?Best?Paper?Award?2013.</title>??
????????<journal>IBM?Research?Report,?San?Jose,?California</journal>??
????????<volume>RJ909</volume>??
????????<month>August</month>??
????????<year>1971</year>??
????????<cdrom>ibmTR/rj909.pdf</cdrom>??
????????<ee>db/labs/ibm/RJ909.html</ee>??
????</article>??
</dblp>??
可以看到,上面的文件中title字段中,既有子元素,也有嵌套。所以,为了同时取到text和子元素中的text,要单独地为取该字段的text写一个函数,下面是两个具体的实现。
5.1 v1.0
首先考虑的是递归读取各个元素的text,然后将它们拼起来,代码如下:
[python]?view plain?copy
?
from?lxml?import?etree#paper2.py??
??
?
def?node_text(node):??
????result?=?node.text.strip()?if?node.text?else?‘‘??
????for?child?in?node:??
????????child_text?=?node_text(child)??
????????if?child_text:??
????????????result?=?result?+?‘?%s‘?%?child_text?if?result?else?child_text??
????return?result??
??
?
if?__name__?==?‘__main__‘:??
????parser?=?etree.XMLParser()??
????root?=?etree.parse(‘paper.xml‘,?parser).getroot()??
????for?element?in?root:??
????????category?=?element.tag??
????????for?attribute?in?element:??
????????????if?attribute.tag?==?"title":??
????????????????print?"title:",?node_text(attribute)??
????????????else:??
????????????????print?attribute.tag+":",attribute.text.strip()??
????????print?""??
运行结果如下:
[plain]?view plain?copy
?
$?python?paper2.py???
author:?E.?F.?Codd??
title:?ES?2??
journal:?IBM?Research?Report,?San?Jose,?California??
volume:?RJ909??
month:?August??
year:?1971??
??
?
author:?E.?F.?Codd??
title:?Entropy??
journal:?IBM?Research?Report,?San?Jose,?California??
volume:?RJ909??
month:?August??
year:?1971??
cdrom:?ibmTR/rj909.pdf??
ee:?db/labs/ibm/RJ909.html??
显然,这个方法只能够取到各个子元素的text,然后将它们拼起来,因此,这并不是我们想要的。不知道当时怎么想的,我居然就直接这样用了。现在看来too young, too simple, always naive。
5.2 v2.0
数据都上线快一年了,发现了这个问题。简直不更sb了,这样,我们就要重新写上面去取得xml一个节点中所有text的函数(现在看来,当初将这一个功能写成一个函数还算是比较科学的),下面是现在的方案:
[python]?view plain?copy
?
from?lxml?import?etree#paper.py??
??
?
def?node_text(node):??
????result?=?""??
????for?text?in?node.itertext():??
????????result?=?result?+?text??
????return?result??
??
?
if?__name__?==?‘__main__‘:??
????parser?=?etree.XMLParser()??
????root?=?etree.parse(‘paper.xml‘,?parser).getroot()??
????for?element?in?root:??
????????category?=?element.tag??
????????for?attribute?in?element:??
????????????if?attribute.tag?==?"title":??
????????????????print?"title:",?node_text(attribute)??
????????????else:??
????????????????print?attribute.tag+":",attribute.text.strip()??
????????print?""??
运行之后得到下面的结果:
[plain]?view plain?copy
?
$?python?paper.py???
author:?E.?F.?Codd??
title:?ES2:?A?cloud?data?storage?system?for?supporting?both?OLTP?and?OLAP.??
journal:?IBM?Research?Report,?San?Jose,?California??
volume:?RJ909??
month:?August??
year:?1971??
??
?
author:?E.?F.?Codd??
title:?Entropy?Best?Paper?Award?2013.??
journal:?IBM?Research?Report,?San?Jose,?California??
volume:?RJ909??
month:?August??
year:?1971??
cdrom:?ibmTR/rj909.pdf??
ee:?db/labs/ibm/RJ909.html??
这样,这个问题总算是解决了。下面的问题就是如何将线上的数据更改过来,当然,这又是另外的一个问题了。
顶
0
python解析xml之lxml
标签:
原文地址:http://www.cnblogs.com/Yiutto/p/5387021.html