python中HTMLParser简单理解

时间：2016-06-25 22:51:59 阅读：499 评论：0 收藏：0 [点我收藏+]

标签：

找一个网页，例如https://www.python.org/events/python-events/，用浏览器查看源码并复制，然后尝试解析一下HTML，输出Python官网发布的会议时间、名称和地点。

 1 from html.parser import HTMLParser
 2 from html.entities import name2codepoint
 3 
 4 class MyHTMLParser(HTMLParser):
 5 
 6 　　in_title = False
 7 　　in_loca = False
 8 　　in_time = False
 9 
10 　　def handle_starttag(self,tag,attrs):
11 　　　　if (‘class‘,‘event-title‘) in attrs:
12 　　　　　　self.in_title = True
13 　　　　elif (‘class‘,‘event-location‘) in attrs:
14 　　　　　　self.in_loca = True
15 　　　　elif tag == ‘time‘:
16 　　　　　　self.in_time = True
17 　　　　　　self.times = []
18 
19 　　def handle_data(self,data):
20 　　　　if self.in_title:
21 　　　　　　print(‘-‘*50)
22 　　　　　　print(‘Title:‘+data.strip())
23 　　　　if self.in_loca:
24 　　　　　　print(‘Location:‘+data.strip())
25 　　　　if self.in_time:
26 　　　　　　self.times.append(data)
27 　　def handle_endtag(self,tag):
28 　　　　if tag == ‘h3‘:self.in_title = False
29 　　　　if tag == ‘span‘:self.in_loca = False
30 　　　　if tag == ‘time‘:
31 　　　　　　self.in_time = False
32 　　　　　　print(‘Time:‘+‘-‘.join(self.times))
33 parser = MyHTMLParser()
34 with open(‘s.html‘) as html:
35 parser.feed(html.read())

重点理解15-17和30-32行，python的HTMLParser在解析网页中的文本时，是按照一个个字符串解析的，

　　<h3 class="event-title"><a href="/events/python-events/401/">PyOhio 2016</a></h3>

　　<span class="event-location">The Ohio Union at The Ohio State University. 1739 N. High Street, Columbus, OH 43210, USA</span>

在遇到特殊字符串时（例如–）会直接跳过，将前后作为两个字符串，15-17和30-32的配合是为了获取span中的年份2016

python中HTMLParser简单理解

标签：

原文地址：http://www.cnblogs.com/dongzhuangdian/p/5616948.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行