Python2获取网页标题

时间：2019-01-28 10:55:04 阅读：309 评论：0 收藏：0 [点我收藏+]

标签：def 5.0 gecko ext print .text 6.2 标题 window

Python获取网页标题

使用Python2.x的urllib2和lxml，速度应该还快于BeautifulSoup4（话说回来，为什么大家都要用BS4呢？一个XPATH不就完了吗）

没有安装过的，用pip安装一下

pip install lxml

Shell演示：

>> from lxml import etree
>> import urllib2
>> page = etree.HTML(urllib2.urlopen(‘https://blog.csdn.net/z690798364/article/details/79960358‘).read().decode(‘utf-8‘))
>> print page.xpath(u"/html/head/title")[0].text
Lxml 解析网页用法笔记 - z690798364的专栏 - CSDN博客

封装好了的函数：

from lxml import etree
import urllib2
#...
def get_site_title(link):
    send_headers = {
        ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.2; rv:16.0) Gecko/20100101 Firefox/16.0‘,
        ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
        ‘Connection‘: ‘keep-alive‘
    }  # 伪装一下header，防止被403
    title = etree.HTML(urllib2.urlopen(urllib2.Request(link, headers=send_headers)).read().decode(‘utf-8‘)).xpath("/html/head/title")
    if title is None:
        raise ‘target miss‘
    return title[0].text

Python2获取网页标题

标签：def 5.0 gecko ext print .text 6.2 标题 window

原文地址：https://www.cnblogs.com/santiego/p/10328428.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行