码迷,mamicode.com
首页 > 编程语言 > 详细

python中HTMLParser简单理解

时间:2016-06-25 22:51:59      阅读:499      评论:0      收藏:0      [点我收藏+]

标签:

找一个网页,例如https://www.python.org/events/python-events/,用浏览器查看源码并复制,然后尝试解析一下HTML,输出Python官网发布的会议时间、名称和地点。

 1 from html.parser import HTMLParser
 2 from html.entities import name2codepoint
 3 
 4 class MyHTMLParser(HTMLParser):
 5 
 6   in_title = False
 7   in_loca = False
 8   in_time = False
 9 
10   def handle_starttag(self,tag,attrs):
11     if (class,event-title) in attrs:
12       self.in_title = True
13     elif (class,event-location) in attrs:
14       self.in_loca = True
15     elif tag == time:
16       self.in_time = True
17       self.times = []
18 
19   def handle_data(self,data):
20     if self.in_title:
21       print(-*50)
22       print(Title:+data.strip())
23     if self.in_loca:
24       print(Location:+data.strip())
25     if self.in_time:
26       self.times.append(data)
27   def handle_endtag(self,tag):
28     if tag == h3:self.in_title = False
29     if tag == span:self.in_loca = False
30     if tag == time:
31       self.in_time = False
32       print(Time:+-.join(self.times))
33 parser = MyHTMLParser()
34 with open(s.html) as html:
35 parser.feed(html.read())

重点理解15-17和30-32行,python的HTMLParser在解析网页中的文本时,是按照一个个字符串解析的,

  <h3 class="event-title"><a href="/events/python-events/401/">PyOhio 2016</a></h3>

  <span class="event-location">The Ohio Union at The Ohio State University. 1739 N. High Street, Columbus, OH 43210, USA</span>

  <time datetime="2016-07-29T00:00:00+00:00">29 July &ndash; 01 Aug. <span class="say-no-more"> 2016</span></time>

在遇到特殊字符串时(例如&ndash;)会直接跳过,将前后作为两个字符串,15-17和30-32的配合是为了获取span中的年份2016

 

python中HTMLParser简单理解

标签:

原文地址:http://www.cnblogs.com/dongzhuangdian/p/5616948.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!