标签:def 5.0 gecko ext print .text 6.2 标题 window
使用Python2.x的urllib2
和lxml
,速度应该还快于BeautifulSoup4
(话说回来,为什么大家都要用BS4呢?一个XPATH不就完了吗)
没有安装过的,用pip
安装一下
pip install lxml
Shell演示:
>> from lxml import etree
>> import urllib2
>> page = etree.HTML(urllib2.urlopen(‘https://blog.csdn.net/z690798364/article/details/79960358‘).read().decode(‘utf-8‘))
>> print page.xpath(u"/html/head/title")[0].text
Lxml 解析网页用法笔记 - z690798364的专栏 - CSDN博客
封装好了的函数:
from lxml import etree
import urllib2
#...
def get_site_title(link):
send_headers = {
‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.2; rv:16.0) Gecko/20100101 Firefox/16.0‘,
‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
‘Connection‘: ‘keep-alive‘
} # 伪装一下header,防止被403
title = etree.HTML(urllib2.urlopen(urllib2.Request(link, headers=send_headers)).read().decode(‘utf-8‘)).xpath("/html/head/title")
if title is None:
raise ‘target miss‘
return title[0].text
标签:def 5.0 gecko ext print .text 6.2 标题 window
原文地址:https://www.cnblogs.com/santiego/p/10328428.html